Here we test and extend in various ways the two hypotheses due to Angela Brown and Delwin Lindsey (Brown & Lindsey, 2004; Lindsey & Brown, 2002) linking, across populations and languages, the color lexicon to the UVR (Ultraviolet Radiation type A and B, 400–280nm; https://en.wikipedia.org/wiki/Ultraviolet) incidence and the frequency of abnormal color perception.
These hypotheses can be represented graphically as shown in the figure below:
We use the following input files:
| Input file | Format | Description |
|---|---|---|
Database_final4.csv |
CSV file (commas + quotes) | database with all variables’ information except distance matrices |
distmat_gen_add.csv |
CSV file (commas + quotes) | distance matrix obtained with Alfred distance, and where missing values have been replaced with an additive method |
distmat_gen_ult.csv |
CSV file (commas + quotes) | distance matrix obtained with Alfred distance, and where missing values have been replaced with an ultrametric method |
distmat_gen_na.csv |
CSV file (commas + quotes) | distance matrix obtained with Alfred distance with missing values |
qlc_data_dist.csv |
CSV file (commas + quotes) | distance matrix obtained with phylogenetic distance on glottolog |
qlc2glottocode.csv |
CSV file (commas + quotes) | table converting the output names from glottolog to glottocodes (used with qlc_data_dist.csv) |
In order to investigate our hypotheses, we analyzed a set of 142 populations dispersed across all continents. This dataset was assembled from three sources:
Each population is identified by the glottocode (Hammarström, Bank, Forkel, & Haspelmath, 2018) of the (primary) language it speaks, as well as by its latitude, longitude and macroarea (also from Glottolog).
For the statistical analyses, longitude was transformed as described in Data transformation.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -16.39 | 24.91 | 50 | 88.86 | 118.2 | 312.4 |
Figure 1. Histogram of longitude.
Figure 2. Map of populations in our sample highlighting their longitude (color).
For the statistical analyses, latitude was transformed as described in Data transformation.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -38.29 | 4.739 | 23.34 | 22.97 | 42.87 | 69.38 |
Figure 3. Histogram of latitude.
Figure 4. Map of populations in our sample highlighting their latitude (color).
| Africa | Australia | Eurasia | North America | Papunesia | South America |
|---|---|---|---|---|---|
| 31 | 2 | 79 | 9 | 9 | 12 |
Figure 5. Map of populations in our sample highlighting their macroarea (color).
In their previous investigations, Brown & Lindsey (2004) and Meeussen (2015) gathered information about color vocabularies according to the definition of Basic Color Categories from Berlin & Kay (1991). For this research, we selected only two variables from these linguistic data: the size of the color lexicon, and the presence of a specific term for blue color. The number of color categories (variable Nb_color_cat) varies between 2 and 12, whereas variable Blue can only have the values yes or no.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 2 | 5 | 7 | 7.662 | 10.75 | 12 |
Figure 6. Number of color categories.
Figure 7. Map of color categories (color scale).
| no | yes |
|---|---|
| 60 | 82 |
Figure 8. Words for blue.
Figure 9. Map of words for blue (color scale).
Data on abnormal red-green color perception for males was collected from 85 references, using the Ishihara test (Ishihara, 1917), the anomaloscope (Patent No. US3382025A, 1968), the Holmgren-Thomson wool test (Thomson, 1880) or the Hardy-Rand-Rittler pseudoisochromatic plate test (Hardy, Rand, & Rittler, 1954).
These data represents the percentage of red-green “color blind” males in the population, and varies between 0% and 11%. Overall “color blindness” rates were used, but when specific information was available, “color blindness” rate exclusively measures deuteranopia, deuteranomaly, protanopia and protanomaly. Thus, data on tritanopia was not included in this variable, since this implies abnormal color perception in the yellow-blue range. Samples where no subdivision between male and female was made, and containing data for less than 50 people, were excluded from this analysis.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0 | 1.87 | 3.268 | 3.88 | 5.652 | 10.68 |
Figure 10. Incidence of daltonism.
Figure 11. Map of incidence of daltonism (color scale).
The UVR incidence was calculated from the data available from the NASA Total Ozone Mapping Spectrometer (TOMS) at http://toms.gsfc.nasa.gov/ery_uv/new_uv/ for the whole year 1998. This variable is a measure of UVR radiation received by the human body and measured in joules per square meter (J/m2), and takes into account the thickness of the ozone layer in the stratosphere, the amount of cloud cover, the elevation, and how high the sun is in the sky. Please note that, for statistical analyses, this variable was z-scored.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 5.683 | 15.45 | 25.12 | 21.71 | 27.9 | 30.58 |
Figure 12. Distribution of UVR.
Figure 13. Map of incidence of UVR (color scale).
We also included here several additional variables that may impact on the amount of UVR received by the eye and on its effects, on the color lexicon, or on the color blindness rate.
Building on Bentz, Dediu, Verkerk, & Jäger (2018), historical data on global weather and climate were obtained from WorldClim, and 19 variables were extracted for the period 1960-1990, including various measures, such as temperature, seasonality or precipitation (see Bentz et al., 2018). A Principal Component Analysis (PCA) found that the first two principal components (PCs) explain most of the data and have meaningful interpretations: the PC1 explains 48.6% of the variance and reflects low seasonality, wet and hot climate, whereas PC2 explains 24.1% of the variance and reflects high seasonality, hot and dry climate (see Bentz et al., 2018 for details).
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -10.06 | -4.672 | -2.032 | -2.071 | 0.08 | 5.195 |
Figure 14. Distribution of climate PC1.
Figure 15. Map of climate PC1 (color scale).
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -6.604 | -2.464 | -0.252 | -0.3692 | 1.572 | 5.03 |
Figure 16. Distribution of climate PC2.
Figure 17. Map of climate PC2 (color scale).
Altitude data was obtained from Bentz et al. (2018); please note that for the analyses, we use log(altitude).
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -Inf | 4.636 | 5.725 | -Inf | 6.732 | 8.337 |
Figure 18. Distribution of log(altitude).
Figure 19. Map of log(altitude) (color scale).
Distances to lakes, oceans, rivers, and water in general, were also obtained from Bentz et al. (2018); they were computed using OpenStreetMap raster file, while distance to water is the minimum distance to any sort of body of water.
Please note that for the analyses, these distances are log’ed.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -0.451 | 1.636 | 2.679 | 2.618 | 3.362 | 5.687 |
Figure 20. Distribution of log distance to lakes.
Figure 21. Map of log distance to lakes (color scale).
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0.2546 | 2.543 | 3.443 | 3.39 | 4.078 | 7.504 |
Figure 22. Distribution of log distance to rivers.
Figure 23. Map of log distance to rivers (color scale).
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 1.223 | 4.007 | 5.133 | 4.968 | 6.252 | 7.104 |
Figure 24. Distribution of log distance to oceans.
Figure 25. Map of log distance to oceans (color scale).
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -0.451 | 1.545 | 2.252 | 2.361 | 3.181 | 5.474 |
Figure 26. Distribution of log distance to water.
Figure 27. Map of log distance to water (color scale).
Subsistence strategy was obtained from AUTOTYP (Bickel et al., 2017) as coded in (Blasi et al., 2019), and represented by the binary variable hg: “yes” represents a population whose subsistence mode is based on hunting, fishing, gathering and/or foraging, whereas a “no” means that the subsistence mode of the population is centered around food production. As these data did not cover all the populations in our dataset, we expanded it by looking at other databases, such as D-Place (Kirby et al., 2016) and Seshat (Turchin et al., 2015).
| no | yes |
|---|---|
| 128 | 14 |
Figure 28. Subsistence strategy.
Figure 29. Map of subsistence strategy (hunting-gathering).
Population size was obtained from (Bentz et al., 2018), and we log’ed it.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0 | 11.01 | 14.46 | 13.78 | 16.73 | 24.25 |
Figure 30. Log population size.
Figure 31. Map of log population size.
language family was obtained from (Bentz et al., 2018); there are 32 uniques language families, most language sbelonging to the Indo-European, Afro-Asiatic and Uralic families. The putative geographic origins of language families (latitude and longitude) were obtained from (Wichmann, Müller, & Velupillai, 2010) supplemented with information from Glottolog (Hammarström et al., 2018).
In the map, only 15 most important language families are represented. Other families are gathered under “Other”.
| abkh1242 | afro1255 | ainu1252 | araw1281 | atha1245 | atla1278 | aust1305 |
|---|---|---|---|---|---|---|
| 1 | 13 | 1 | 1 | 2 | 19 | 3 |
| aust1307 | ayma1253 | basq1248 | chib1249 | drav1251 | eski1264 | hadz1240 |
|---|---|---|---|---|---|---|
| 9 | 1 | 1 | 1 | 2 | 3 | 1 |
| indo1319 | japo1237 | jiva1245 | kore1284 | maya1287 | nilo1247 | nucl1710 |
|---|---|---|---|---|---|---|
| 41 | 1 | 1 | 1 | 4 | 3 | 3 |
| otom1299 | pama1250 | pano1259 | sino1245 | taik1256 | ticu1244 | tupi1275 |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 9 | 1 | 1 | 1 |
| turk1311 | ural1272 | utoa1244 | yano1268 |
|---|---|---|---|
| 4 | 9 | 1 | 1 |
Figure 32. Language families. Only the 15 most represented language families are explicitely shown, the other being gathered in the umbrella category ‘Other’.
Figure 33. Map of main language families. Only the 15 most represented language families are explicitely shown, the other being gathered in the umbrella category ‘Other’.
Figure 34. Origins of the putative origins of language families.
Unfortunately, there is apparently a lack of information concerning cross-population variation in the opsin genes (OPN1MW and OPN1LW). However, using a set of 96 micro-haplotypes and 382 SNP variants available from the ALFRED database (Rajeevan, 2003), we computed the overall genetic distance between population with Arlequin (Excoffier & Lischer, 2010). Missing values in the final distance matrix were estimated using an ultrametric (Lapointe & Kirsch, 1995) and additive (De Soete, 1984) data imputation methods (see Appendix).
Phylogenetic data was obtained from the Glottolog (Hammarström et al., 2018), and a distance matrix between our focus populations was created using qlcData and cophenetic in R (R Core Team, 2020). For more details, please see the section about Point Pattern Analysis.
Finally, we applied classic multi-dimensional scaling (cmdscale) to the previously computed distance matrix, and selected the minimum numbers of principal coordinates resulting in a goodness-of-fit of at least 50%.
For some statistical analyses (incuding regression models), we performed the following data transofrmations:
log) was applied to distance to water, oceans, rivers, lakes, altitude, and population size, as their range on the original measurement scales were very wide with a few extreme outliers;cos(X*(pi/180)): these new values capture the closeness to the equator, with 1.0 representing a location on the equator, and 0.0 representing a location at one of the poles (the actual variation of our data ws between 0.4 and 1.0);Thus, a total of 19 variables (excluding the genetic multi-dimensional scaling dimensions) have been used for the analysis:
Five distance matrices were also collected among the populations (see Appendix and Genetic distance from Alfred (FST) for more info):
We performed a Point Pattern Analysis (PPA) of our datapoints. PPA is a set of methods used for the detection of patterns and spatial arrangements and structures in a set of map locations (Spielman, 2017). Our dataset can be considered as a relatively random sample of populations, as it was driven by data availability for the color vocabulary and daltonism rate.
Our datapoints represent populations and their locations are defined by longitude and latitude on the Earth’s surface. Those points have marks attached to them, and we focus here on two such marks: a categorical mark (Blue) and a continous mark (daltonism). Covariates, such as elevation or climate, are a type of data treated as explanatory rather than as part of the a priori hypotheses, and they have been used in previous studies of linguistic diversity (e.g., Bentz et al., 2018; Everett, 2013). Here, we used statistical methods traditionnally used in PPA, such as comparaison with a Poisson process, modelling and spatial autocorrelation.
We also explored neighbouring coefficients and various distance matrices. The neighbouring coefficient describes the degree of similarity that populations share with their neighbours for different attributes, estimated with correlograms using Pearson correlations between the focus populations and their neighbours. Here, the neighbours were defined using various distance matrices, and we averaged the correlation coefficients across these distance matrices.
Here we list various resources that we used for these analyses:
We used two different data formats: class ppp and class sp:
ppp: one without marks, one with Blue as a mark, one with daltonism as a mark,ppp representing point pattern datasets in the two-dimensional plane.sp with marksclass : SpatialPointsDataFrame
features : 142
extent : -176.207, 178.33, -38.2881, 69.3761 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
variables : 2
names : dalto, Blue
min values : 0, no
max values : 10.68, yes
ppp with no marksPlanar point pattern: 142 points
window: rectangle = [-16.3916, 312.4226] x [-38.2881, 69.3761] units
ppp with Blue as markMarked planar point pattern: 142 points
Multitype, with levels = no, yes
window: rectangle = [-16.3916, 312.4226] x [-38.2881, 69.3761] units
ppp with daltonism as markMarked planar point pattern: 142 points
marks are numeric, of storage type 'double'
window: rectangle = [-16.3916, 312.4226] x [-38.2881, 69.3761] units
Figure 35. Density surfaces for languages without a specific term for Blue.
Figure 36. Density surfaces for languages with a specific term for Blue.
We can see in the plots above that the densities of Blue=“no” and Blue=“yes” are different: population having a specific term for Blue are mostly clustured in Europe, whereas population lacking a word for Blue are mostly found in Africa, South America, the Middle East and South Asia.
Figure 37. Pairwise densitiy surfaces for Blue ‘yes’ and ‘no’.
This graph includes two density surfaces interacting with each other: the density surface for “yes” condition, and the density surface for the “no” condition. In other terms, it computed the smoothed kernel intensity estimate for each condition (yes and no) and displayed scatterplots of the each condition pair.
We used 5 different distance matrices and 1 neighbouring matrix, constructed using differents packages:
Visualize only first 6 rows and columns:
| abua1244 | acol1236 | ainu1240 | alba1267 | alge1239 | amha1245 | |
|---|---|---|---|---|---|---|
| abua1244 | 0 | 2879 | 13060 | 4223 | 3405 | 3703 |
| acol1236 | 2879 | 0 | 11321 | 4330 | 4639 | 1187 |
| ainu1240 | 13060 | 11321 | 0 | 9010 | 10329 | 10137 |
| alba1267 | 4223 | 4330 | 9010 | 0 | 1591 | 3767 |
| alge1239 | 3405 | 4639 | 10329 | 1591 | 0 | 4500 |
| amha1245 | 3703 | 1187 | 10137 | 3767 | 4500 | 0 |
This is based on a method developed in Cysouw, Dediu, & Moran (2012) and adapted by Marc Tang: it computes neighbouring datapoints by taking into account the natural barriers to population movement represented by seas and oceans (see figures below).
nerror = 4
Increasing madj from 176 to 212 and trying again.
nerror = 4
Increasing madj from 212 to 255 and trying again.
Let’s visualize the neighbouring matrix for the 6 first rows:
| id | glottocode | long | lat | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | abua1244 | 6.615 | 4.831 | 49 | 99 | 139 | NA | NA | NA | NA | NA | NA | NA | NA |
| 2 | acol1236 | 32.51 | 3.577 | 13 | 55 | 63 | 73 | 74 | 96 | NA | NA | NA | NA | NA |
| 3 | ainu1240 | 142.5 | 43.63 | 24 | 56 | 68 | 91 | 141 | NA | NA | NA | NA | NA | NA |
| 4 | alba1267 | 20 | 41 | 21 | 44 | 51 | 54 | 110 | NA | NA | NA | NA | NA | NA |
| 5 | alge1239 | 3.23 | 35.42 | 49 | 76 | 87 | 122 | NA | NA | NA | NA | NA | NA | NA |
| 6 | amha1245 | 39.54 | 11.71 | 13 | 63 | 119 | NA | NA | NA | NA | NA | NA | NA | NA |
Genetic distances were estimated using FST on data from the ALFRED database (see method in Appendix). Because this genetic distance matrix has many missing values, two imputation methods were used (in addition to using the distance matrix as it is): additive and ultrametic.
Let’s visualize the additive matrix, for the 6 first rows and columns:
| kaba1278 | krio1252 | mesc1238 | alge1239 | nyun1247 | gusi1247 | |
|---|---|---|---|---|---|---|
| kaba1278 | 0 | 0.6572 | 0.4678 | 0.2358 | 0.2357 | 0.513 |
| krio1252 | 0.6572 | 0 | -0.3712 | -0.7722 | -0.9272 | -0.3804 |
| mesc1238 | 0.4678 | -0.3712 | 0 | -0.8432 | -1.005 | 0.5988 |
| alge1239 | 0.2358 | -0.7722 | -0.8432 | 0 | 0.2555 | -0.8523 |
| nyun1247 | 0.2357 | -0.9272 | -1.005 | 0.2555 | 0 | -1.015 |
| gusi1247 | 0.513 | -0.3804 | 0.5988 | -0.8523 | -1.015 | 0 |
Let’s visualize the ultrametric matrix, for the 6 first rows and columns:
| kaba1278 | krio1252 | mesc1238 | alge1239 | nyun1247 | gusi1247 | |
|---|---|---|---|---|---|---|
| kaba1278 | 0 | 0.6572 | 0.4678 | 0.2358 | 0.2357 | 0.513 |
| krio1252 | 0.6572 | 0 | 0.05766 | 0.03598 | 0.04613 | 0.00367 |
| mesc1238 | 0.4678 | 0.05766 | 0 | 0.05766 | 0.05766 | 0.5988 |
| alge1239 | 0.2358 | 0.03598 | 0.05766 | 0 | 0.2555 | 0.03598 |
| nyun1247 | 0.2357 | 0.04613 | 0.05766 | 0.2555 | 0 | 0.04613 |
| gusi1247 | 0.513 | 0.00367 | 0.5988 | 0.03598 | 0.04613 | 0 |
Let’s visualize the no data imputation matrix, for the 6 first rows and columns:
| kaba1278 | krio1252 | mesc1238 | alge1239 | nyun1247 | gusi1247 | |
|---|---|---|---|---|---|---|
| kaba1278 | 0 | 0.6572 | 0.4678 | 0.2358 | 0.2357 | 0.513 |
| krio1252 | 0.6572 | 0 | NA | NA | NA | NA |
| mesc1238 | 0.4678 | NA | 0 | NA | NA | 0.5988 |
| alge1239 | 0.2358 | NA | NA | 0 | 0.2555 | NA |
| nyun1247 | 0.2357 | NA | NA | 0.2555 | 0 | NA |
| gusi1247 | 0.513 | NA | 0.5988 | NA | NA | 0 |
The table qlc_data_dist.csv was computed using functions QlcData and cophenetic.
Let’s visualize the phylogenetic matrix, for the 6 first rows and columns:
| pare1266 | carn1240 | viet1252 | fiji1243 | wall1257 | samo1305 | |
|---|---|---|---|---|---|---|
| pare1266 | 0 | 150.6 | 150.6 | 188.2 | 188.2 | 188.2 |
| carn1240 | 150.6 | 0 | 150.6 | 188.2 | 188.2 | 188.2 |
| viet1252 | 150.6 | 150.6 | 0 | 188.2 | 188.2 | 188.2 |
| fiji1243 | 188.2 | 188.2 | 188.2 | 0 | 94.12 | 94.12 |
| wall1257 | 188.2 | 188.2 | 188.2 | 94.12 | 0 | 70.59 |
| samo1305 | 188.2 | 188.2 | 188.2 | 94.12 | 70.59 | 0 |
For brms, to select the best model, we used Bayes factors, WAIC, LOO and KFOLD. Please note that, form brms, there might be differences between Bayes factors, on the one hand, and WAIC/LOO/KFOLD, on the other, due to the default use of improper priors (see, for example, https://stats.stackexchange.com/questions/407964/bayes-factors-and-predictive-accuracy-in-model-comparison-in-rstan-brms); therefore, we will both methods for model selection.
Under the assumption of CSR (complete spatial randomness), points are independent of each other and have the same likelihood of being found at any location. These assumptions are almost never met: instead, the location of points is driven by point attributes (marks), other covariates, and random noise.
Poisson is a discrete probability distribution; let’s visualize it:
We can think of our null hypothesis for the spatial distribution of our points as one of complete spatial randomness (CSR). By comparing our data to this null of randomness, we can determine whether our point process is significantly departing from spatial randomness by being either clustered or regularly spaced.
The “classic methods” is to compare our data to a CSR process using a χ2 test based on quadrat counts, or a Monte-Carlo test.
| Test statistic | df | P value | Alternative hypothesis |
|---|---|---|---|
| 401 | 74 | 4.202e-46 * * * | two.sided |
| Test statistic | P value | Alternative hypothesis |
|---|---|---|
| 401 | 0.001 * * * | two.sided |
We can see that both produce very low p-values, so we can reject the null hypothesis of CSR.
The Kolmogorov-Smirnov test compares our empirical (sampled) data to a theoretical (in our case, CSR) ideal and see if the two are significantly different. With point data, we specify a real function \(T(x,y)\) defined at all locations \(x,y\) in our sampling window, and we evaluate it at each of the data points and compare the empirical values of \(T\) with the predicted distribution of \(T\) under the CSR assumptions.
Figure 38. Kolmogorov-Smirnov versus the CSR using only the ‘x’ coordinate.
We can also input a density function as a covariate, to estimate overall spatial randomness:
Figure 39. Kolmogorov-Smirnov versus the CSR using on two dimensions.
We can see that the observed and expected (if our data were CSR) cumulative distributions are pretty different.
Thus, our tests of spatial randomness were performed in order to compare the distribution of our datapoints to the distribution of a Poisson Process, where points have the same likelihood of being found at any location. Both the χ2 test and Monte-Carlo test suggest that the distribution of our datapoints is significantly different from a Poisson Process, which is also supported the Kolmogorov-Smirnov test. In a way, these results confirm the obvious point that our data distribution is not randomly distributed over the whole surface of the Earth!
Our point process is not random, but how? Is it because of datapoints are regularly space or clustered?
We will compare the distances between datapoints; these distances can be of several kinds:
Function value object (class 'fv')
for the function r -> G(r)
.....................................................................
Math.label Description
r r distance argument r
theo G[pois](r) theoretical Poisson G(r)
han hat(G)[han](r) Hanisch estimate of G(r)
rs hat(G)[bord](r) border corrected estimate of G(r)
km hat(G)[km](r) Kaplan-Meier estimate of G(r)
hazard hat(h)[km](r) Kaplan-Meier estimate of hazard function h(r)
theohaz h[pois](r) theoretical Poisson hazard function h(r)
.....................................................................
Default plot formula: .~r
where "." stands for 'km', 'rs', 'han', 'theo'
Recommended range of argument r: [0, 8.324]
Available range of argument r: [0, 30.226]
Figure 40. The G function.
For all values of r (the distances between points) our empirical function is greater than the theoretical one, suggesting that nearest neighbor distances among our data points are shorter than for a Poisson process, suggesting a clustered pattern.
Function value object (class 'fv')
for the function r -> F(r)
.....................................................................
Math.label Description
r r distance argument r
theo F[pois](r) theoretical Poisson F(r)
cs hat(F)[cs](r) Chiu-Stoyan estimate of F(r)
rs hat(F)[bord](r) border corrected estimate of F(r)
km hat(F)[km](r) Kaplan-Meier estimate of F(r)
hazard hat(h)[km](r) Kaplan-Meier estimate of hazard function h(r)
theohaz h[pois](r) theoretical Poisson hazard h(r)
.....................................................................
Default plot formula: .~r
where "." stands for 'km', 'rs', 'cs', 'theo'
Recommended range of argument r: [0, 30.281]
Available range of argument r: [0, 30.281]
Figure 41. The F function.
Empirical values above the theoretical values (blue) indicate that empty space distances in our empirical point pattern are shorter than for a Poisson process, suggesting again a clustered pattern.
Function value object (class 'fv')
for the function r -> K(r)
..............................................................
Math.label Description
r r distance argument r
theo K[pois](r) theoretical Poisson K(r)
border hat(K)[bord](r) border-corrected estimate of K(r)
trans hat(K)[trans](r) translation-corrected estimate of K(r)
iso hat(K)[iso](r) isotropic-corrected estimate of K(r)
..............................................................
Default plot formula: .~r
where "." stands for 'iso', 'trans', 'border', 'theo'
Recommended range of argument r: [0, 26.916]
Available range of argument r: [0, 26.916]
Figure 42. The K function.
This test confirms the previous results: our data seems to be clustered.
Thus, the G function (nearest neighbour distance), F function (empty space distance) and K function (pairwise distance) plots all show a clustered pattern.
We test here if elevation is driving the location of our datapoints; we create two fitting function for our point process: one model with a Poisson Process and one model with a Poisson Process + elevation as a covariate.
Figure 43. Elevation as a covariate.
We use the function ppm (point process model) to fit a model (this is analogous to the model fitting functions in R such as lm and glm). The statistic \(S(u)\) is specified by an R language formula, like the formulas used to specify the systematic relationship in a linear model orgeneralised linear model.
Nonstationary Poisson process
Log intensity: ~side(x)
Fitted trend coefficients:
(Intercept) side(x)right
-4.8026460 -0.2792705
Estimate S.E. CI95.lo CI95.hi Ztest Zval
(Intercept) -4.8026460 0.02207554 -4.845913 -4.759379 *** -217.555092
side(x)right -0.2792705 0.03364014 -0.345204 -0.213337 *** -8.301703
We plot the predictions of the fitting functions:
Figure 44. Poisson model with elevation as a covariate.
Figure 45. Poisson model fitted with intensity that is proportional to elevation.
The summary of the fitting function with elevation as a covariate:
Point process model
Fitting method: maximum likelihood (Berman-Turner approximation)
Model was fitted using glm()
Algorithm converged
Call:
ppm.ppp(Q = mydatappp2, trend = ~cov1, covariates = list(cov1 = ele))
Edge correction: "border"
[border correction distance r = 0 ]
--------------------------------------------------------------------------------
Quadrature scheme (Berman-Turner) = data + dummy + weights
Data pattern:
Planar point pattern: 142 points
Average intensity 0.00372 points per square unit
Window: rectangle = [-176.207, 178.33] x [-38.2881, 69.3761] units
(354.5 x 107.7 units)
Window area = 38170.9 square units
Dummy quadrature points:
32 x 32 grid of dummy points, plus 4 corner points
dummy spacing: 11.079281 x 3.364506 units
Original dummy parameters: =
Planar point pattern: 1028 points
Average intensity 0.0269 points per square unit
Window: rectangle = [-176.207, 178.33] x [-38.2881, 69.3761] units
(354.5 x 107.7 units)
Window area = 38170.9 square units
Quadrature weights:
(counting weights based on 32 x 32 array of rectangular tiles)
All weights:
range: [5.33, 37.3] total: 38200
Weights on data points:
range: [5.33, 18.6] total: 2030
Weights on dummy points:
range: [5.33, 37.3] total: 36100
--------------------------------------------------------------------------------
FITTED MODEL:
Nonstationary Poisson process
---- Intensity: ----
Log intensity: ~cov1
Model depends on external covariate 'cov1'
Covariates provided:
cov1: im
Fitted trend coefficients:
(Intercept) cov1
-5.269965833 0.000493119
Estimate S.E. CI95.lo CI95.hi Ztest
(Intercept) -5.269965833 8.704763e-02 -5.4405760603 -5.0993556058 ***
cov1 0.000493119 4.080347e-05 0.0004131457 0.0005730923 ***
Zval
(Intercept) -60.54117
cov1 12.08522
----------- gory details -----
Fitted regular parameters (theta):
(Intercept) cov1
-5.269965833 0.000493119
Fitted exp(theta):
(Intercept) cov1
0.005143786 1.000493241
We compared the models using the likelihood ratio test of the null hypothesis of a homogeneous Poisson process (CSR) against the alternative of an inhomogeneous Poisson process with intensity that is a function of the elevation covariate.
| Npar | Df | Deviance | Pr(>Chi) |
|---|---|---|---|
| 1 | NA | NA | NA |
| 2 | 1 | 171.4 | 3.629e-39 |
The p-value is very small, rejecting CSR in favour of the alternative model.
Spatial autocorrelation describes the degree to which variables are similar to each other at different spatial locations. We computed Moran’s index (Moran, 1950), which is a measure of global spatial autocorrelation, at different distance matrices.
We will use 3 different methods to compute spatial autocorrelation with Moran’s index:
Moran’s I for daltonism:
$observed
[1] 0.1760609
$expected
[1] -0.007092199
$sd
[1] 0.02082568
$p.value
[1] 0
Moran’s I for Blue:
$observed
[1] 0.2024622
$expected
[1] -0.007092199
$sd
[1] 0.02093014
$p.value
[1] 0
Thus, there is a significative positive spatial autocorrelation for both Blue and daltonism.
Instead of using the whole distance matrix, we will use bins of distance. For example, populations living close enough (between a certain distance range, such as between 0 and 1000 km away) will be considered as being close, others will be considered as being far. The information conveyed in the distance matrix will be converted to into binary information.
We will use 2 differents ranges :
Then, the average number of neighbours per population will increase when the range is large.
Moran’s I for daltonism, bins=2000 and bins=1000:
$observed
[1] 0.3764329
$expected
[1] -0.007092199
$sd
[1] 0.0445232
$p.value
[1] 0
$observed
[1] 0.4416968
$expected
[1] -0.007092199
$sd
[1] 0.06377459
$p.value
[1] 1.962652e-12
Moran’s I for Blue, bins=2000 and bins=1000:
$observed
[1] 0.3244207
$expected
[1] -0.007092199
$sd
[1] 0.04474737
$p.value
[1] 1.276756e-13
$observed
[1] 0.3694625
$expected
[1] -0.007092199
$sd
[1] 0.06409501
$p.value
[1] 4.229832e-09
We can see that the results with the 3 different bins are very similar, showing a positive significative autocorrelation, and very similar to the results from the inverse distance matrix.
For each language, we compute its nearest neighbours:
| id | glottocode | long | lat | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | abua1244 | 6.61492 | 4.83057 | 49 | 99 | 139 | NA | NA | NA | NA | NA | NA | NA | NA |
| 2 | acol1236 | 32.51470 | 3.57738 | 13 | 55 | 63 | 73 | 74 | 96 | NA | NA | NA | NA | NA |
| 3 | ainu1240 | 142.46167 | 43.63365 | 24 | 56 | 68 | 91 | 141 | NA | NA | NA | NA | NA | NA |
| 4 | alba1267 | 20.00000 | 41.00000 | 21 | 44 | 51 | 54 | 110 | NA | NA | NA | NA | NA | NA |
| 5 | alge1239 | 3.23033 | 35.42080 | 49 | 76 | 87 | 122 | NA | NA | NA | NA | NA | NA | NA |
| 6 | amha1245 | 39.54346 | 11.70818 | 13 | 63 | 119 | NA | NA | NA | NA | NA | NA | NA | NA |
Moran’s I for daltonism:
Moran I test under normality
data: dbdalto$daltonism
weights: gg2
Moran I statistic standard deviate = 7.478, p-value = 3.773e-14
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.414876086 -0.007092199 0.003184125
Moran’s I for Blue:
Moran I test under normality
data: as.numeric(dbdalto$Blue)
weights: gg2
Moran I statistic standard deviate = 5.5324, p-value = 1.58e-08
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.305088031 -0.007092199 0.003184125
Montecarlo test for daltonism:
Monte-Carlo simulation of Moran I
data: dbdalto$daltonism
weights: gg2
number of simulations + 1: 100
statistic = 0.41488, observed rank = 100, p-value = 0.01
alternative hypothesis: greater
Montecarlo test for Blue:
Monte-Carlo simulation of Moran I
data: as.numeric(dbdalto$Blue)
weights: gg2
number of simulations + 1: 100
statistic = 0.30509, observed rank = 100, p-value = 0.01
alternative hypothesis: greater
UVB and Number of color categories are two variables of interest in our dataset; let’s have a look to their spatial autocorrelation:
Moran’s I for UVR:
$observed
[1] 0.3912663
$expected
[1] -0.007092199
$sd
[1] 0.020865
$p.value
[1] 0
Moran’s I for number of color categories:
$observed
[1] 0.1733223
$expected
[1] -0.007092199
$sd
[1] 0.0208884
$p.value
[1] 0
There is also a high positive spatial autocorrelation, especially UVR.
Multiple methods (inverse distance matrix, bins of distances, and Delaunay neighbours) emphasize the high degree of spatial autocorrelation for Blue and daltonism.
Here, we use the distance matrices explained above to compute the autocorrelations of the variables Blue and daltonism, as we aim to find which distance matrix results in the highest autocorrelations.
The number of neighbours is manually fixed at 10:
Steps used to compute the correlation coefficient (method 1):
Thus, the value for KNN=i represents the correlation between the focus population and only the values of the neighbours i steps away.
We also computed (method 2) the correlation between the focus populations and all its neighbours closer than KNN=i. For example, for KNN=5, the correlation is computed between the focus population and the mean of all its neighbours that are included in the 5 steps radius (KNN=1,2,3,4 and 5). This method is probably more appropriate for our analysis here, as we want to study neighbours in general.
Figure 46. Method 1, for daltonism.
Figure 47. Method 2, for daltonism.
We used t-test to compare the means of the two autocorrelations:
| Test statistic | df | P value | Alternative hypothesis | mean of x |
|---|---|---|---|---|
| 2.247 | 9.557 | 0.04964 * | two.sided | 0.5349 |
| mean of y |
|---|
| 0.4982 |
We apply the same steps with Blue variable (converted from yes/no into a numerical binary variable 0/1).
Figure 48. Method 1, for blue
Figure 49. Method 2, for blue
| Test statistic | df | P value | Alternative hypothesis | mean of x |
|---|---|---|---|---|
| 4.087 | 11.5 | 0.001641 * * | two.sided | 0.5468 |
| mean of y |
|---|
| 0.4471 |
Thus, we found a high neighbouring coefficient for Blue (r=0.62 with KNN=3 with genetic distance matrix) and daltonism (r=0.57 with KNN=3 with genetic distance matrix) variables, supporting the finding of a high spatial autocorrelation found for these variables.
Moreover, geographic and genetic distance matrices were compared in terms of which generates the highest neighbouring coefficient. Interestingly, the variants of genetic distance (no data imputation method, additive data imputation, ultrametric data imputation) and the phylogenetic distance (from Glottolog) resultd in very different results. When the number of nearest neighbours is < 8, the genetic distance with ultrametric imputation explained best the patterning of our data points, for both Blue and daltonism. In particular, it resulted in a significantly higher similarity between neighbouring coefficients than geographical distance. When the number of nearest neighbours is >= 9, all distance matrices tend to produce equivalent rates of similarity.
We first investigated if the number of color categories may predict variable Blue. Indeed, according to Berlin & Kay (1991), color categories are acquired through successive steps. Thus, a language possessing 6 color words will possess a word for blue, whereas a language having only 5 color terms will not. If this theory is correct, then the number of color categories should entirely predict the presence or absence of a specific blue term. However, several authors have noticed that it does not account for all languages.
In our dataset, only 60.56 % of langage follow Berlin and Kay’s stages. The distributions of the number of color categories and the presence of a word for Blue overlaps:
Figure 50. The distribution of the number of color categories for language with (blue) and without (red) a word for Blue.
In our data, 4.9% of the languages with less than 6 color words do have a word for blue, while 20.8% of languages with more than 6 color words do not have a word for blue.
Thus, for our data, while the number of color categories is indeed an important predictor of the presence of a word for blue (as predicted by Berlin and Kay’s stages theory), it does not, however, account for all the variation in our dataset.
Second, we carried out a mediation analysis (using the brm function from brms package in R: Bürkner, 2018a and the mediation function from spatstat; Baddeley, Rubak, & Turner, 2015) to test whether the number of color categories is a mediating factor between UVR incidence and the presence of a dedicated word for blue, and whether the relation between the latter and latitude is mediated by UVR incidence.
Due to current limitations in the brm function, we can not study study the effect of latitude quadratic, but only latitude linear.
Interpretation of the mediation output:
(a*b).(c').(c'), which has been computed using c = c' + ab. Family: MV(gaussian, bernoulli)
Links: mu = identity; sigma = identity
mu = logit
Formula: Nb_color_cat ~ UVB_z + (1 | macroarea) + (1 | family_code)
Blue ~ UVB_z + Nb_color_cat + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Nbcolorcat_Intercept) 1.22 0.36 0.59 2.01 1.00 10727
sd(Blue_Intercept) 0.60 0.50 0.02 1.86 1.00 11097
Tail_ESS
sd(Nbcolorcat_Intercept) 11104
sd(Blue_Intercept) 11768
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Nbcolorcat_Intercept) 0.89 0.76 0.04 2.81 1.00 10098
sd(Blue_Intercept) 0.84 0.93 0.02 3.25 1.00 9632
Tail_ESS
sd(Nbcolorcat_Intercept) 11163
sd(Blue_Intercept) 10203
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Nbcolorcat_Intercept 6.99 0.62 5.78 8.19 1.00 11649
Blue_Intercept -6.25 1.33 -9.02 -3.78 1.00 11637
Nbcolorcat_UVB_z -0.92 0.26 -1.43 -0.42 1.00 11380
Blue_UVB_z -0.98 0.36 -1.72 -0.31 1.00 11968
Blue_Nb_color_cat 0.92 0.17 0.62 1.28 1.00 12038
Tail_ESS
Nbcolorcat_Intercept 11487
Blue_Intercept 11724
Nbcolorcat_UVB_z 11468
Blue_UVB_z 12142
Blue_Nb_color_cat 11606
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_Nbcolorcat 2.19 0.14 1.93 2.48 1.00 11725 11052
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
With prob = 0.9:
# Causal Mediation Analysis for Stan Model
Treatment: UVB_z
Mediator: Nb_color_cat
Response: Blue
Estimate HDI (90%)
Direct effect: -0.97 [-1.55 -0.38]
Indirect effect: -0.83 [-1.31 -0.38]
Total effect: -1.80 [-2.61 -1.07]
Proportion mediated: 46.14% [25.42% 66.86%]
With prob = 0.95:
# Causal Mediation Analysis for Stan Model
Treatment: UVB_z
Mediator: Nb_color_cat
Response: Blue
Estimate HDI (95%)
Direct effect: -0.97 [-1.66 -0.26]
Indirect effect: -0.83 [-1.44 -0.33]
Total effect: -1.80 [-2.78 -0.94]
Proportion mediated: 46.14% [20.99% 71.28%]
Please note that as Blue (outcome) is binary, direct and indirect effects may be on different scales.
The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.
The faint lines (yrep) are posterior predictive draws.
Family: MV(gaussian, bernoulli)
Links: mu = identity; sigma = identity
mu = logit
Formula: UVB_z ~ lat_rescaled + (1 | macroarea) + (1 | family_code)
Blue ~ UVB_z + lat_rescaled + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(UVBz_Intercept) 0.13 0.07 0.01 0.27 1.00 9076 9369
sd(Blue_Intercept) 1.00 0.74 0.05 2.83 1.00 8537 10554
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(UVBz_Intercept) 0.14 0.13 0.01 0.45 1.00 9205 10401
sd(Blue_Intercept) 1.53 1.35 0.09 5.14 1.00 10311 10830
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
UVBz_Intercept -4.69 0.23 -5.16 -4.25 1.00 11617 11602
Blue_Intercept -6.94 4.16 -15.28 1.04 1.00 11893 11788
UVBz_lat_rescaled 5.60 0.24 5.15 6.08 1.00 11785 11690
Blue_UVB_z -2.55 0.82 -4.24 -1.03 1.00 11715 11688
Blue_lat_rescaled 8.23 4.79 -0.96 17.85 1.00 11901 11766
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz 0.35 0.02 0.31 0.39 1.00 11987 11929
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
With prob = 0.9:
# Causal Mediation Analysis for Stan Model
Treatment: lat_rescaled
Mediator: UVB_z
Response: Blue
Estimate HDI (90%)
Direct effect: 8.15 [ 0.40 16.16]
Indirect effect: -14.07 [-21.76 -6.62]
Total effect: -5.90 [ -9.65 -2.42]
Proportion mediated: 238.42% [40.55% 436.29%]
With prob = 0.95:
# Causal Mediation Analysis for Stan Model
Treatment: lat_rescaled
Mediator: UVB_z
Response: Blue
Estimate HDI (95%)
Direct effect: 8.15 [ -1.0 17.81]
Indirect effect: -14.07 [-23.3 -5.36]
Total effect: -5.90 [-10.5 -1.93]
Proportion mediated: 238.42% [-28.88% 505.72%]
Please note that as Blue (outcome) is binary, direct and indirect effects may be on different scales.
The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.
Family: MV(gaussian, bernoulli)
Links: mu = identity; sigma = identity
mu = logit
Formula: UVB_z ~ lat_fam_rescaled + (1 | macroarea)
Blue ~ UVB_z + lat_fam_rescaled + (1 | macroarea)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(UVBz_Intercept) 0.35 0.22 0.10 0.91 1.00 10684 11489
sd(Blue_Intercept) 1.83 1.36 0.37 5.39 1.00 9506 10149
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
UVBz_Intercept -2.78 0.42 -3.58 -1.95 1.00 11599
Blue_Intercept -2.78 2.18 -7.04 1.54 1.00 11883
UVBz_lat_fam_rescaled 3.42 0.41 2.60 4.22 1.00 11675
Blue_UVB_z -1.56 0.40 -2.38 -0.83 1.00 12187
Blue_lat_fam_rescaled 3.45 2.29 -1.03 8.13 1.00 12344
Tail_ESS
UVBz_Intercept 11319
Blue_Intercept 11788
UVBz_lat_fam_rescaled 11206
Blue_UVB_z 11842
Blue_lat_fam_rescaled 12213
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz 0.71 0.04 0.63 0.80 1.00 11857 11534
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
With prob = 0.9:
# Causal Mediation Analysis for Stan Model
Treatment: lat_fam_rescaled
Mediator: UVB_z
Response: Blue
Estimate HDI (90%)
Direct effect: 3.42 [-0.29 7.17]
Indirect effect: -5.22 [-7.77 -2.88]
Total effect: -1.83 [-4.84 1.14]
Proportion mediated: 285.68% [-1069.29% 1640.65%]
With prob = 0.95:
# Causal Mediation Analysis for Stan Model
Treatment: lat_fam_rescaled
Mediator: UVB_z
Response: Blue
Estimate HDI (95%)
Direct effect: 3.42 [-1.19 7.90]
Indirect effect: -5.22 [-8.37 -2.53]
Total effect: -1.83 [-5.56 1.51]
Proportion mediated: 285.68% [-2377.77% 2949.13%]
Please note that as Blue (outcome) is binary, direct and indirect effects may be on different scales.
The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.
The mediation analysis shows that UVR has both a direct and an indirect effect on Blue, the latter mediated by the number of color categories (when having macroarea and family as random effects).
On the other hand, latitude (and latitude2) does have an effect on Blue, but this is fully mediated by UVR incidence (when having macroarea and family as random effects).
We model here the effects of UVR incidence on Blue using Bayesian mixed-effects models (implemented in package brms; Bürkner (2018b)). Because the repartition of populations is uneven with respect to geographical location and language family, we used language family and macroarea as random effects in all our models.
Family: bernoulli
Links: mu = logit
Formula: Blue ~ UVB_z + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.14 0.83 0.06 3.14 1.00 7686 10150
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.52 1.31 0.08 4.96 1.00 9483 10686
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.02 0.95 -1.89 1.97 1.00 11483 11194
UVB_z -1.33 0.37 -2.14 -0.68 1.00 10917 11068
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
Parameter | 95% HDI
--------------------------
Intercept | [-1.99, 1.85]
UVB_z | [-2.08, -0.63]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
Parameter | inside ROPE
-----------------------
Intercept | 1.25 %
UVB_z | 0.00 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
1 b_Intercept -0.01 0.01 0.012 fixed conditional
2 b_UVB_z -0.01 0.01 0.000 fixed conditional
While we also included random slopes, they did not make any significant contribution, and we retained the random intercepts-only model (Bayes factor 0.02, very strong evidence for random intercepts only).
UVR has a strong negative effect on Blue (Bayes factor 8963.18, extreme evidence for model with UVR incidence; 95% HDI for βUVR = [-2.08, -0.63]), explaining R2 = 35.73% of the variance, and correctly classifying 73.46% of the populations (recall = 77.57%, precision = 78.22%).
The best model is:
Family: bernoulli
Links: mu = logit
Formula: Blue ~ UVB_z + Nb_color_cat + log_dist2lakes + hg + lat_fam_rescaled + lon_fam_rescaled + lat_rescaled + lon_rescaled + I(lat_fam_rescaled^2) + I(lon_fam_rescaled^2) + I(lat_rescaled^2) + I(Nb_color_cat^2) + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.03 1.89 0.25 7.56 1.00 8678 9294
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 3.40 2.92 0.14 10.93 1.00 10783 11528
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -22.45 22.24 -70.79 18.88 1.00 11852 11360
UVB_z -5.20 1.84 -9.10 -1.90 1.00 11887 11621
Nb_color_cat 7.78 2.41 3.58 13.07 1.00 10925 11669
log_dist2lakes -1.15 0.44 -2.10 -0.39 1.00 11022 11458
hgyes -3.42 2.62 -8.91 1.31 1.00 11000 11198
lat_fam_rescaled -59.36 70.46 -198.99 82.77 1.00 11770 11475
lon_fam_rescaled -1.80 2.74 -7.63 3.16 1.00 11616 11561
lat_rescaled 8.09 40.79 -74.76 86.81 1.00 11722 11687
lon_rescaled 2.86 2.08 -0.92 7.28 1.00 11511 11438
Ilat_fam_rescaledE2 29.51 47.64 -70.50 122.99 1.00 11720 11488
Ilon_fam_rescaledE2 2.60 4.39 -4.79 12.96 1.00 10876 11311
Ilat_rescaledE2 19.45 26.13 -29.54 73.27 1.00 11343 11114
INb_color_catE2 -0.39 0.14 -0.70 -0.14 1.00 11119 11432
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
# Fixed Effects (Conditional Model)
Parameter | 95% HDI
---------------------------------------
Intercept | [ -67.42, 21.61]
UVB_z | [ -8.93, -1.78]
Nb_color_cat | [ 3.29, 12.61]
log_dist2lakes | [ -2.02, -0.34]
hgyes | [ -8.72, 1.47]
lat_fam_rescaled | [-198.14, 83.52]
lon_fam_rescaled | [ -7.45, 3.30]
lat_rescaled | [ -71.20, 89.50]
lon_rescaled | [ -1.02, 7.18]
Ilat_fam_rescaledE2 | [ -70.56, 122.98]
Ilon_fam_rescaledE2 | [ -5.05, 12.40]
Ilat_rescaledE2 | [ -30.34, 72.22]
INb_color_catE2 | [ -0.68, -0.13]
# Random Effects (Conditional Model)
Parameter | 95% HDI
-----------------------------------
Nb_color_cat | [ 3.29, 12.61]
INb_color_catE2 | [ -0.68, -0.13]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
# Fixed Effects (Conditional Model)
Parameter | inside ROPE
---------------------------------
Intercept | 0.03 %
UVB_z | 0.00 %
Nb_color_cat | 0.00 %
log_dist2lakes | 0.00 %
hgyes | 0.13 %
lat_fam_rescaled | 0.01 %
lon_fam_rescaled | 0.43 %
lat_rescaled | 0.00 %
lon_rescaled | 0.16 %
Ilat_fam_rescaledE2 | 0.00 %
Ilon_fam_rescaledE2 | 0.27 %
Ilat_rescaledE2 | 0.03 %
INb_color_catE2 | 0.00 %
# Random Effects (Conditional Model)
Parameter | inside ROPE
-----------------------------
Nb_color_cat | 0.00 %
INb_color_catE2 | 0.00 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
7 b_Intercept -0.01 0.01 2.5e-04 fixed conditional
15 b_UVB_z -0.01 0.01 0.0e+00 fixed conditional
13 b_Nb_color_cat -0.01 0.01 0.0e+00 random conditional
14 b_Nb_color_cat -0.01 0.01 0.0e+00 fixed conditional
10 b_log_dist2lakes -0.01 0.01 3.3e-04 fixed conditional
1 b_hgyes -0.01 0.01 1.3e-03 fixed conditional
8 b_lat_fam_rescaled -0.01 0.01 8.3e-05 fixed conditional
11 b_lon_fam_rescaled -0.01 0.01 4.1e-03 fixed conditional
9 b_lat_rescaled -0.01 0.01 0.0e+00 fixed conditional
12 b_lon_rescaled -0.01 0.01 1.5e-03 fixed conditional
2 b_Ilat_fam_rescaledE2 -0.01 0.01 0.0e+00 fixed conditional
4 b_Ilon_fam_rescaledE2 -0.01 0.01 2.6e-03 fixed conditional
3 b_Ilat_rescaledE2 -0.01 0.01 2.5e-04 fixed conditional
5 b_INb_color_catE2 -0.01 0.01 2.5e-04 fixed conditional
6 b_INb_color_catE2 -0.01 0.01 2.5e-04 random conditional
We used the variance inflation factor (VIF) with a threshold of 5, to detect multicollinearity between the covariates in this model.
| UVB_z | Nb_color_cat | log_dist2lakes | hg | lat_fam_rescaled |
|---|---|---|---|---|
| 13.73 | 1.285 | 1.144 | 1.455 | 7.675 |
| lon_fam_rescaled | lat_rescaled | lon_rescaled |
|---|---|---|
| 5.105 | 26.9 | 5.081 |
UVR and latitude have the highest VIF scores in our dataset, but, according to our mediation analysis, latitude has an effect on Blue only through UVR. We compared (using both Bayes factors and K-fold) the following four models:
Bayes factor: model_nolat vs model_nouvb: 0.012; very strong evidence for model_nouvb.
Bayes factor: model_nolat vs model_nouvb_lin: 13.909; strong evidence for model_nolat.
Bayes factor: model_nolat vs model_nouvb_quadr: 6.848; moderate evidence for model_nolat.
elpd_diff se_diff
model_nolat 0.0 0.0
model_nouvb -2.0 3.7
elpd_diff se_diff
model_nolat 0.0 0.0
model_nouvb_lin -1.2 3.1
elpd_diff se_diff
model_nouvb_quadr 0.0 0.0
model_nolat -1.6 2.6
With Bayes factors: (latitude + latitude2) > UVR > [latitude2 & latitude]
With LOO/WAIC/KFOLD: all models seem comparable (the difference between models is smaller than the SE).
Let’s remove UVR from our best complex model:
Bayes factor: model_nolat vs model_nouvb: 650.848; extreme evidence for best model.
… and let’s remove latitude from our best model (quadratic and linear):
Bayes factor: model_nolat vs model_nouvb: 594687.942; extreme evidence for best model.
Thus, we should keep, UVR and latitude (and latitude2) as predictors of Blue.
This complex model includes many variables. In order to check which variables are the most important, we computed the best model according to LOO, WAIC and KFOLD methods.
With this the best model is:
Family: bernoulli
Links: mu = logit
Formula: Blue ~ Nb_color_cat + dist2lakes + I(Nb_color_cat^2) + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.90 1.05 0.21 4.27 1.00 8727 9119
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.24 1.27 0.04 4.64 1.00 9285 9887
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -22.48 6.61 -36.60 -11.04 1.00 11120 11643
Nb_color_cat 5.28 1.68 2.33 8.85 1.00 10784 11768
dist2lakes -0.01 0.01 -0.03 0.00 1.00 11870 11763
INb_color_catE2 -0.27 0.10 -0.48 -0.09 1.00 10647 11722
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
# Fixed Effects (Conditional Model)
Parameter | 95% HDI
----------------------------------
Intercept | [-35.65, -10.37]
Nb_color_cat | [ 2.20, 8.64]
dist2lakes | [ -0.03, 0.00]
INb_color_catE2 | [ -0.47, -0.09]
# Random Effects (Conditional Model)
Parameter | 95% HDI
----------------------------------
Nb_color_cat | [ 2.20, 8.64]
INb_color_catE2 | [ -0.47, -0.09]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
# Fixed Effects (Conditional Model)
Parameter | inside ROPE
-----------------------------
Intercept | 0.00 %
Nb_color_cat | 0.00 %
dist2lakes | 28.88 %
INb_color_catE2 | 0.00 %
# Random Effects (Conditional Model)
Parameter | inside ROPE
-----------------------------
Nb_color_cat | 0.00 %
INb_color_catE2 | 0.00 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
4 b_Intercept -0.01 0.01 0.00000 fixed conditional
5 b_Nb_color_cat -0.01 0.01 0.00000 random conditional
6 b_Nb_color_cat -0.01 0.01 0.00000 fixed conditional
1 b_dist2lakes -0.01 0.01 0.29525 fixed conditional
2 b_INb_color_catE2 -0.01 0.01 0.00083 fixed conditional
3 b_INb_color_catE2 -0.01 0.01 0.00083 random conditional
We use the same approach as before, but this time we remove the number of color categories from the set of potential predictors, as its inclusion is debatable.
Whith these, the best fit (according to Bayes factors) is very similar to the best fit obtained before:
Family: bernoulli
Links: mu = logit
Formula: Blue ~ UVB_z + log_popSize + log_dist2lakes + hg + lat_fam_rescaled + lon_fam_rescaled + lat_rescaled + lon_rescaled + I(lat_fam_rescaled^2) + I(lon_fam_rescaled^2) + I(lat_rescaled^2) + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 2.85 1.84 0.34 7.45 1.00 7439 8606
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 5.22 3.35 0.70 13.73 1.00 9957 9771
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -8.06 18.85 -49.26 26.93 1.00 11143 10471
UVB_z -4.16 1.47 -7.35 -1.56 1.00 11761 11574
log_popSize 0.54 0.16 0.28 0.88 1.00 10204 11844
log_dist2lakes -0.84 0.27 -1.43 -0.35 1.00 11447 11488
hgyes -2.48 1.84 -6.24 1.07 1.00 11991 11452
lat_fam_rescaled -33.88 61.82 -152.83 95.30 1.00 11035 10451
lon_fam_rescaled -1.19 2.23 -5.70 3.03 1.00 11359 11000
lat_rescaled 24.42 35.03 -44.00 93.91 1.00 11121 11587
lon_rescaled 0.97 1.73 -2.23 4.58 1.00 11369 11683
Ilat_fam_rescaledE2 24.38 42.11 -61.80 105.75 1.00 10991 10648
Ilon_fam_rescaledE2 2.26 3.98 -4.07 11.56 1.00 11095 10778
Ilat_rescaledE2 -6.91 21.73 -49.20 36.39 1.00 11113 11597
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
Parameter | 95% HDI
---------------------------------------
Intercept | [ -46.07, 29.33]
UVB_z | [ -7.06, -1.42]
log_popSize | [ 0.26, 0.85]
log_dist2lakes | [ -1.41, -0.34]
hgyes | [ -6.07, 1.21]
lat_fam_rescaled | [-154.20, 92.51]
lon_fam_rescaled | [ -5.69, 3.04]
lat_rescaled | [ -44.00, 93.92]
lon_rescaled | [ -2.24, 4.56]
Ilat_fam_rescaledE2 | [ -58.99, 107.65]
Ilon_fam_rescaledE2 | [ -4.98, 10.44]
Ilat_rescaledE2 | [ -48.32, 37.22]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
Parameter | inside ROPE
---------------------------------
Intercept | 0.05 %
UVB_z | 0.00 %
log_popSize | 0.00 %
log_dist2lakes | 0.00 %
hgyes | 0.19 %
lat_fam_rescaled | 0.01 %
lon_fam_rescaled | 0.38 %
lat_rescaled | 0.04 %
lon_rescaled | 0.44 %
Ilat_fam_rescaledE2 | 0.03 %
Ilon_fam_rescaledE2 | 0.21 %
Ilat_rescaledE2 | 0.03 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
5 b_Intercept -0.01 0.01 5.0e-04 fixed conditional
12 b_UVB_z -0.01 0.01 0.0e+00 fixed conditional
9 b_log_popSize -0.01 0.01 0.0e+00 fixed conditional
8 b_log_dist2lakes -0.01 0.01 1.7e-04 fixed conditional
1 b_hgyes -0.01 0.01 1.8e-03 fixed conditional
6 b_lat_fam_rescaled -0.01 0.01 8.3e-05 fixed conditional
10 b_lon_fam_rescaled -0.01 0.01 3.6e-03 fixed conditional
7 b_lat_rescaled -0.01 0.01 4.2e-04 fixed conditional
11 b_lon_rescaled -0.01 0.01 4.2e-03 fixed conditional
2 b_Ilat_fam_rescaledE2 -0.01 0.01 2.5e-04 fixed conditional
4 b_Ilon_fam_rescaledE2 -0.01 0.01 2.0e-03 fixed conditional
3 b_Ilat_rescaledE2 -0.01 0.01 2.5e-04 fixed conditional
Cheking the importance of UVR:
Bayes factor: bestmodel2 vs bf_bestmodel2_nouvb: 1265.286; extreme evidence for keeping UVR.
With this the best model is:
Family: bernoulli
Links: mu = logit
Formula: Blue ~ UVB_z + log_popSize + log_dist2lakes + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.34 0.95 0.07 3.62 1.00 7624 10629
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 2.94 2.31 0.18 8.93 1.00 8015 8999
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.67 2.08 -6.86 1.46 1.00 10972 10700
UVB_z -1.73 0.51 -2.88 -0.87 1.00 9838 11138
log_popSize 0.42 0.11 0.23 0.67 1.00 10510 11005
log_dist2lakes -0.52 0.21 -0.97 -0.14 1.00 9695 11313
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
Parameter | 95% HDI
-------------------------------
Intercept | [-6.68, 1.61]
UVB_z | [-2.79, -0.82]
log_popSize | [ 0.22, 0.65]
log_dist2lakes | [-0.95, -0.12]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
Parameter | inside ROPE
----------------------------
Intercept | 0.14 %
UVB_z | 0.00 %
log_popSize | 0.00 %
log_dist2lakes | 0.00 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
1 b_Intercept -0.01 0.01 0.0013 fixed conditional
4 b_UVB_z -0.01 0.01 0.0000 fixed conditional
3 b_log_popSize -0.01 0.01 0.0000 fixed conditional
2 b_log_dist2lakes -0.01 0.01 0.0011 fixed conditional
Cheking the importance of UVR:
Bayes factor: bestmodel2_kfold vs bestmodel2_kfold_nouvb: 19744.72; extreme evidence for keeping UVR.
elpd_diff se_diff
bestmodel2_kfold 0.0 0.0
bestmodel2_kfold_nouvb -3.0 6.1
Number of color categories. As predicted, this has a strong positive effect on Blue; thus, the more color categories in a language, the more likely it is to have a word for blue.
Population Size. This becomes predictive only if we do not consider the number of color categories (in fact, those two variables are highly positively correlated); thus, the larger the population, the more likely to have a word for blue.
Figure 1. Population size vs having a word for Blue.
UVR. As suggested by Lindsey and Brown, UVR incidence is an important predictor for Blue; the higher the UVR incidence, the less likely it is to have a word for blue.
Figure 52. UVR incidence vs having a word for Blue.
Figure 53. Map of UVR incidence on having a word for Blue.
Latitude. Latitude is highly correlated with UVR incidence, and its effects on Blue fully mediated by it; the closer to the equator a language is, the less likely it is to have a specific word for blue.
Figure 2. Latitude on having a word for Blue.
Distance to lakes. It is unclear why the distance to lakes is a predictor for the presence of words for Blue, but it very probably is a proxy for other (unmeasured) variables (see decision trees below); the more distant from lakes a language is, the less likely it is to have a word for blue.
Figure 55. Map of log(dist2lakes) on having a word for Blue.
Figure 3. log(dist2lakes) on having a word for Blue.
Latitude and longitude of putative language family origins. Depending on the model selection method used, these are included or not in the set of relevant predictors.
Figure 57. Map of latitude of language family origins on having a word for Blue.
Figure 58. Map of longitude of language family origins on having a word for Blue.
Figure 4. Location of putative origins of language family on having a word for Blue.
HG. The effect of subsistence strategies is one of the weakest predictors, with hunter-gatherers (HG) being less likely to possess a specific term for blue.
Figure 60. Map of subsistence strategy on having a word for Blue.
| Blue:no | Blue:yes | |
|---|---|---|
| HG:no | 50 | 78 |
| HG:yes | 10 | 4 |
We used random forests to estimate of the relative importance of our variables in predicting the existence of words for Blue.
Random Forests are supervised machine learning algorithms which can be used for both classification and regression. It constructs a multitude of decision trees on the training sample and gets the prediction from each of the tree (bagging). Then, it selects the best solution by means of voting. Compared to classic decision tree methods, random forests algorithm structure avoid overfitting on training data.
Relative importance of our variables can be measured with two indexes: gini impurity and accuracy-based method. While Gini impurity is a measurement of the likelihood of incorrect classification of randomly classified new instances, accuracy-based methods use the mean dicrease in classification accuracy after permuting each element over all trees. Because the choice of training/testing dataset matters, we will run the algorithm several times with random allocations of the training and testing sets, and computed the mean variable importance.
However, it should be note that random forests have troubles modelling shared variance due to hierarchical structure, here in particular due to language family and macroarea.
See https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-forest/ for details:
Gini-based importance: When a tree is built, the decision about which variable to split at each node uses a calculation of the Gini impurity. For each variable, the sum of the Gini decrease across every tree of the forest, every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. There is a bias towards using numeric variables to split nodes because there are potentially many split points.
Accuracy-based importance: Each tree has its testing sample of data that was not used during construction. This sample is used to calculate importance of a specific variable. First, the prediction accuracy on the testing sample is measured. Then, the values of the variable in the testing sample are randomly shuffled, keeping all other variables the same. Finally, the decrease in prediction accuracy on the shuffled data is measured. The mean decrease in accuracy across all trees is reported. This importance measure is also broken down by outcome class. Intuitively, the random shuffling means that, on average, the shuffled variable has no predictive power. This importance is a measure of by how much removing a variable decreases accuracy, and vice versa — by how much including a variable increases accuracy. Note that if a variable has very little predictive power, shuffling may lead to a slight increase in accuracy due to random noise. This in turn can give rise to small negative importance scores, which can be essentially regarded as equivalent to zero importance.
The randomForest algorithm correctly classifies 82.82 % of the populations (recall = 86.51 and precision 84.18).
Figure 61. Variable importance according to randomForest algorithm using the Gini index.
Figure 62. Variable importance according to randomForest algorithm using the accuracy index.
The cForest algorithm correctly classifies 81.55 % of the populations (recall = 85.61 and precision 82.99).
Figure 63. Variable importance according to cforest algorithm using the accuracy index.
Here, we find again the number of color categories as the most important predictor, followed by UVR incidence, daltonism, population size, latitude and distance to lakes. Thus, we found relatively similar results to those form the regression analysis, but also differences: daltonism was not predictive in the regression models, population size only when excluding the number color categories, and the latitude and longitude of language family origins do not appear to be important variables here.
The randomForest algorithm correctly classifies 76.04 % of the populations (recall = 77.68 and precision 80.21).
Figure 64. Variable importance according to randomForest algorithm using the Gini index.
Figure 65. Variable importance according to randomForest algorithm using the accuracy index.
The cForest algorithm correctly classifies 79.89 % of the populations (recall = 82.78 and precision 82.46).
Figure 66. Variable importance according to cForest algorithm using the accuracy index.
When the number of color categories is excluded, we find that population size and UVR and daltonism are the most important predictors, followed by latitude and distance to lakes.
While the results from random forests and regression are congruent, especially concerning the importance of UVR and number of color categories/population size, we also found some differences:
We used simple decision trees as implemented by ctree (Hothorn, Hornik, & Zeileis, 2006).
Figure 67. Simple decision tree.
To understand what this decision tree does, we plotted which populations have less/more than 0.726 for z-scored UVR, and less/more than 4.7 for log_dist2lakes.
Figure 68. UVR (z-scored) as split by the decision tree.
Figure 69. Log of distance to lakes as split by the decision tree.
Thus, it seems that the distance to lakes is relevant only when there are exactly 6 color categories, in which case populations living more than about 25 km (= 24.7) away from the closest lake do not have a specific word for Blue, whereas population living closer have a 75% chance to possess such a word. While we could not find a satisfing explanation for this “lake effect”, we suggest some hypotheses (among many others) below:
To find out if there are some others relations between the variables in our dataset and Blue that could have been missed with by regression, random forest and decision tree models, we compared their performance with that of Support Vector Machines (SVM; as implemented by the e1071 package).
SVM is a supervised machine learning technique, which can be used for both classification and regression. It uses the kernel trick to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.
We randomly splitted the dataset into a training and testing set, and we computed the precision, recall and classification rate from the model’s confusion matrix. The training set contains 80% of the dataset, and we performed a stratified sampling on the variable Blue. Because our dataset is quite small, the choice of the training and testing set impacted the results. To avoid this problem, we ran those algorithm a 30 times to compute a mean classification rate, recall and precision.
Model with all covariates:
Figure 70. Mean performance of classification with all covariates for brms, randomForest and SVM algorithms.
Model with only Number of color categories (linear and quadratic) and distance to lakes as covariates:
Figure 71. Mean performance of classification for brms, randomForest and SVM algorithms for a model including only number of color categories (linear and quadratic) and distance to lakes.
The performance rate is very similar for all algorithms. The classification accuracy on the testing set is high for bayesian mixed-effect model, random forest and SVM. Thus, a simple mixed-effect bayesian model is enough to catch the explanatory power of our variables for predicting the presence of a specific term for blue.
See Plotting the predictors for data visualization, but the main conclusions are:
Then, we investigated the relation between UVR and daltonism. As a preliminary analysis, we conducted a mediation analysis (using the mediation package in R; Tingley, Yamamoto, Hirose, Keele, & Imai, 2014) to test whether UVR is a mediator between latitude and daltonism.
Second, we carried out a mediation analysis (using the brm function from brms package in R: Bürkner, 2018a and the mediation function from spatstat; Baddeley et al., 2015) to test whether UVR is a mediator between latitude and daltonism, and between latitude of language family and daltonism.
Family: MV(gaussian, beta)
Links: mu = identity; sigma = identity
mu = logit; phi = identity
Formula: UVB_z ~ lat_rescaled + (1 | macroarea) + (1 | family_code)
daltonism_beta ~ UVB_z + lat_rescaled + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(UVBz_Intercept) 0.13 0.07 0.01 0.27 1.00 8957
sd(daltonismbeta_Intercept) 0.30 0.20 0.02 0.76 1.00 7143
Tail_ESS
sd(UVBz_Intercept) 9499
sd(daltonismbeta_Intercept) 8804
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(UVBz_Intercept) 0.14 0.13 0.01 0.45 1.00 9123
sd(daltonismbeta_Intercept) 0.64 0.40 0.15 1.65 1.00 9540
Tail_ESS
sd(UVBz_Intercept) 10285
sd(daltonismbeta_Intercept) 9637
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
UVBz_Intercept -4.69 0.23 -5.17 -4.25 1.00 10662
daltonismbeta_Intercept -3.73 1.04 -5.75 -1.68 1.00 11156
UVBz_lat_rescaled 5.60 0.24 5.14 6.09 1.00 10858
daltonismbeta_UVB_z -0.30 0.18 -0.64 0.07 1.00 11555
daltonismbeta_lat_rescaled -0.01 1.15 -2.29 2.19 1.00 11170
Tail_ESS
UVBz_Intercept 11506
daltonismbeta_Intercept 11765
UVBz_lat_rescaled 11448
daltonismbeta_UVB_z 11648
daltonismbeta_lat_rescaled 11845
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz 0.35 0.02 0.31 0.39 1.00 11376 11572
phi_daltonismbeta 52.21 7.23 39.19 67.41 1.00 10945 11788
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
With prob = 0.90:
# Causal Mediation Analysis for Stan Model
Treatment: lat_rescaled
Mediator: UVB_z
Response: daltonismbeta
Estimate HDI (90%)
Direct effect: 0.00 [-1.99 1.78]
Indirect effect: -1.66 [-3.34 0.00]
Total effect: -1.65 [-2.47 -0.90]
Proportion mediated: 100.41% [-26.19% 227.02%]
With prob = 0.95:
# Causal Mediation Analysis for Stan Model
Treatment: lat_fam_rescaled
Mediator: UVB_z
Response: daltonismbeta
Estimate HDI (95%)
Direct effect: 0.92 [ 0.04 1.80]
Indirect effect: -1.24 [-1.88 -0.69]
Total effect: -0.34 [-1.10 0.44]
Proportion mediated: 366.93% [-2782.92% 3516.78%]
The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.
Family: MV(gaussian, beta)
Links: mu = identity; sigma = identity
mu = logit; phi = identity
Formula: UVB_z ~ lat_fam_rescaled + (1 | macroarea)
daltonism_beta ~ UVB_z + lat_fam_rescaled + (1 | macroarea)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(UVBz_Intercept) 0.35 0.22 0.10 0.88 1.00 10623
sd(daltonismbeta_Intercept) 0.75 0.42 0.32 1.81 1.00 10502
Tail_ESS
sd(UVBz_Intercept) 11096
sd(daltonismbeta_Intercept) 11353
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat
UVBz_Intercept -2.78 0.42 -3.58 -1.95 1.00
daltonismbeta_Intercept -4.56 0.54 -5.64 -3.54 1.00
UVBz_lat_fam_rescaled 3.42 0.41 2.61 4.23 1.00
daltonismbeta_UVB_z -0.37 0.08 -0.52 -0.21 1.00
daltonismbeta_lat_fam_rescaled 0.93 0.45 0.06 1.81 1.00
Bulk_ESS Tail_ESS
UVBz_Intercept 11344 11220
daltonismbeta_Intercept 10700 11556
UVBz_lat_fam_rescaled 11362 11198
daltonismbeta_UVB_z 12049 11623
daltonismbeta_lat_fam_rescaled 11422 11973
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma_UVBz 0.71 0.04 0.63 0.80 1.00 11300 11976
phi_daltonismbeta 50.48 6.56 38.24 63.92 1.00 11904 11806
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
With prob = 0.9:
# Causal Mediation Analysis for Stan Model
Treatment: lat_fam_rescaled
Mediator: UVB_z
Response: daltonismbeta
Estimate HDI (90%)
Direct effect: 0.92 [ 0.20 1.68]
Indirect effect: -1.24 [-1.75 -0.76]
Total effect: -0.34 [-0.97 0.32]
Proportion mediated: 366.93% [-1249.26% 1983.11%]
With prob = 0.95:
# Causal Mediation Analysis for Stan Model
Treatment: lat_fam_rescaled
Mediator: UVB_z
Response: daltonismbeta
Estimate HDI (95%)
Direct effect: 0.92 [ 0.04 1.80]
Indirect effect: -1.24 [-1.88 -0.69]
Total effect: -0.34 [-1.10 0.44]
Proportion mediated: 366.93% [-2782.92% 3516.78%]
The faint lines (yrep) are posterior predictive draws, and y is the observed distribution.
The mediation analysis indicates that the effect of latitude (and latitude2) on the population frequency of abnormal color percption is fully mediated by UVR incidence (when macroarea and family are random effects).
We investigate here the effects of UVR incidence on daltonism.
We use the same methods as used the previous part, namely Bayesian mixed-effects models (implemented in package brms; Bürkner (2018b)) with language family and macroarea as random effects. As daltonism is a percentage, we used Beta regression.
Family: beta
Links: mu = logit; phi = identity
Formula: daltonism_beta ~ UVB_z + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 3000; warmup = 500; thin = 1;
total post-warmup samples = 10000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.29 0.19 0.02 0.75 1.00 1336 2436
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.63 0.36 0.15 1.57 1.00 2337 1977
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -3.74 0.32 -4.43 -3.11 1.00 3591 3281
UVB_z -0.29 0.07 -0.45 -0.15 1.00 5088 6282
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 52.41 7.13 39.40 67.53 1.00 5846 6664
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
Parameter | 95% HDI
--------------------------
Intercept | [-4.38, -3.07]
UVB_z | [-0.44, -0.15]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
Parameter | inside ROPE
-----------------------
Intercept | 0.00 %
UVB_z | 0.00 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
1 b_Intercept -0.01 0.01 0 fixed conditional
2 b_UVB_z -0.01 0.01 0 fixed conditional
As before, including random slopes did not make any significant contribution (Bayes factor 0.02, very strong evidence for random intercepts only). Consequently, we retained the random intercepts-only model.
The effect of UVR on daltonism is strong and negative:
The model with UVR explains R2 = 49.06% of the variance. The Root Mean Square Error of the prediction (RMSE) is 2.18.
We added covariates and selected the best model with Bayes factors on one hand, and with WAIC, LOO and KFOLD on the other hand.
The best model is:
Family: beta
Links: mu = logit; phi = identity
Formula: daltonism_beta ~ Blue + lat_rescaled + lat_fam_rescaled + I(lat_rescaled^2) + I(lat_fam_rescaled^2) + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.20 0.17 0.01 0.61 1.00 8208 9223
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.55 0.35 0.13 1.44 1.00 10524 9872
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -3.98 1.59 -7.24 -0.98 1.00 11717 10727
Blueyes 0.47 0.15 0.18 0.77 1.00 12229 11765
lat_rescaled 0.13 3.94 -7.54 8.01 1.00 11822 11687
lat_fam_rescaled 1.07 4.73 -7.78 11.20 1.00 11000 10269
Ilat_rescaledE2 -1.30 2.68 -6.61 3.92 1.00 11862 11687
Ilat_fam_rescaledE2 -0.08 3.37 -7.22 6.25 1.00 11027 10394
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 54.74 7.37 41.59 70.11 1.00 11670 11529
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
Parameter | 95% HDI
------------------------------------
Intercept | [-7.28, -1.03]
Blueyes | [ 0.19, 0.77]
lat_rescaled | [-7.43, 8.04]
lat_fam_rescaled | [-8.22, 10.60]
Ilat_rescaledE2 | [-6.62, 3.89]
Ilat_fam_rescaledE2 | [-6.85, 6.57]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
Parameter | inside ROPE
---------------------------------
Intercept | 0.00 %
Blueyes | 0.00 %
lat_rescaled | 0.18 %
lat_fam_rescaled | 0.19 %
Ilat_rescaledE2 | 0.34 %
Ilat_fam_rescaledE2 | 0.24 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
4 b_Intercept -0.01 0.01 0.00025 fixed conditional
1 b_Blueyes -0.01 0.01 0.00050 fixed conditional
6 b_lat_rescaled -0.01 0.01 0.00175 fixed conditional
5 b_lat_fam_rescaled -0.01 0.01 0.00183 fixed conditional
3 b_Ilat_rescaledE2 -0.01 0.01 0.00325 fixed conditional
2 b_Ilat_fam_rescaledE2 -0.01 0.01 0.00225 fixed conditional
The best model using Bayes factor explains R2 = 51.3% of the variance. The Root Mean Square Error of the prediction (RMSE) is 2.15.
Is HG predictive?
Bayes factor: best model with HG vs best model without HG: 2.165; anecdotal evidence for best light model with HG.
elpd_diff se_diff
d_bestmodel_1 0.0 0.0
d_bestmodel -0.7 2.8
There is an anecdotal evidence for the model with hg. The effect is not significant and negative: hunter-gatherers are more likely not to have a word for blue.
According to our mediation analysis, latitude (and latitude2) have an effect on daltonism only through the mediator UVR.
This complex model includes many variables. In order to check which variables are the most important, we computed the best model according to LOO, WAIC and KFOLD methods.
With this the best model is:
Family: beta
Links: mu = logit; phi = identity
Formula: daltonism_beta ~ Blue + UVB_z + I(lat_fam_rescaled^2) + (1 | macroarea) + (1 | family_code)
Data: bNA (Number of observations: 142)
Samples: 4 chains, each with iter = 20000; warmup = 5000; thin = 5;
total post-warmup samples = 12000
Group-Level Effects:
~family_code (Number of levels: 32)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.15 0.13 0.01 0.51 1.00 10114 9435
~macroarea (Number of levels: 6)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.61 0.36 0.20 1.51 1.00 10548 10692
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -4.40 0.44 -5.28 -3.53 1.00 11607 11225
Blueyes 0.47 0.15 0.18 0.75 1.00 11636 11009
UVB_z -0.27 0.08 -0.43 -0.11 1.00 12113 11646
Ilat_fam_rescaledE2 0.52 0.38 -0.24 1.27 1.00 12210 11608
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 55.10 7.27 41.77 70.04 1.00 11617 10582
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Highest Density Interval
Parameter | 95% HDI
------------------------------------
Intercept | [-5.25, -3.50]
Blueyes | [ 0.19, 0.76]
UVB_z | [-0.43, -0.11]
Ilat_fam_rescaledE2 | [-0.22, 1.29]
# Proportion of samples inside the ROPE [-0.01, 0.01]:
Parameter | inside ROPE
---------------------------------
Intercept | 0.00 %
Blueyes | 0.00 %
UVB_z | 0.00 %
Ilat_fam_rescaledE2 | 0.91 %
Parameter ROPE_low ROPE_high p_ROPE Effects Component
3 b_Intercept -0.01 0.01 0.00000 fixed conditional
1 b_Blueyes -0.01 0.01 0.00025 fixed conditional
4 b_UVB_z -0.01 0.01 0.00058 fixed conditional
2 b_Ilat_fam_rescaledE2 -0.01 0.01 0.00867 fixed conditional
The best model using K-fold explains R2 = 51.67% of the variance. The Root Mean Square Error of the prediction (RMSE) is 2.09.
Adding HG:
Bayes factor: best model with HG vs best model without HG: 1.686; anecdotal evidence for best light model with HG.
elpd_diff se_diff
d_bestlightmodel 0.0 0.0
d_bestlightmodel_1 -3.7 3.5
Subsistence mode should not be added to this model.
UVR must be kept:
Bayes factor: best light model vs best light model without UVB: 36.668; very strong evidence for best light model.
elpd_diff se_diff
d_bestlightmodel 0.0 0.0
d_bestlightmodel_nouvb -3.5 3.2
UVR. UVR has a negative significant impact on Daltonism; thus, the higher the UVR incidence on a population, the less green-red color blind males there will be in this population.
Figure 72. UVR incidence vs daltonism rate.
Figure 73. Map of UVR incidence on daltonism rate. A low daltonism rate is a rate below the median of all daltonism rates (3.3), whereas a high daltonism rate codes for rates above 3.3.
Latitude. Latitude is highly correlated with UVR incidence, and its effects on Blue fully mediated by it; The closer to the equator a population is, the less likely it is to include red-green color-blind males.
Figure 74. Latitude on daltonism rate.
HG. With Bayes factor, HG slightly improve the fit.
However, one should note that only 14 hunter-gatherers groups are present in our dataset.
Hunter-gatherer males are less color-blind, even in places where the UVR incidence is low. We may suggest that selective pressures for color perception is higher for hunter-gatherers than for agriculture-based communities.
Figure 75. Subsistence strategy on daltonism rate.
Blue. Population having a high red-green color blindness rate tends to have a specific term for blue in their vocabulary.
According to our hypothesis, Blue would be a proxy for green-blue perception, but there would be no causal relationship between Blue and daltonism.
Figure 76. Having a word for blue on daltonism rate.
Latitude of language family.
It is significant included in the set of relevant predictors.
Figure 77. Map of latitude of language family origins on daltonism rate.
Figure 78. Location of putative origins of language family on daltonism rate.
We used random forests to estimate of the relative importance of our variables in predicting the daltonism rate. Variable importance was computed with gini-based and permutation-based indexes through the same process as before.
For more information, please see Random forests.
The Root Mean Squared Error (RMSE) for randomForest function is 207.18 and the R-squared explained variance is 32.58.
Figure 79. Variable importance according to randomForest algorithm using the Gini index.
Figure 80. Variable importance according to randomForest algorithm using the accuracy index.
The Root Mean Squared Error (RMSE) for cForest function is 205.73 and the R-squared explained variance is 33.51.
Figure 81. Variable importance according to cforest algorithm using the accuracy index.
Thus, we find that UVR incidence is the most important predictor, but also that latitude and the presence of a blue word matter. Depending on the method used, climate PC1, climate PC2, longitude and subsistence strategy are also included in the set of relevant predictors.
These results are consistent with the results from the regression analysis, except for the inclusion here of climate variables; moreover, the latitude and longitude of language family origins do not appear to be important variables here.
We used simple decision trees as implemented by ctree (Hothorn et al., 2006).
Figure 82. Simple decision tree.
To understand what this decision tree does, we plotted which populations have less/more than -0.337 and -1.315 for z-scored UVR, less/more than 8.275 for log_dist2oceans, and less/more than 0.793 for latitude of language family origin.
Figure 83. UVR (z-scored) as split by the decision tree.
Figure 83. UVR (z-scored) as split by the decision tree.
Figure 84. Log of distance to oceans as split by the decision tree.
Figure 85. Latitude of language family origins as split by the decision tree.
Here, UVR is the main and first node in the tree; as suggested by the regression analysis, Blue and subsistence mode play important roles:
Here, we compare the performance of Bayesian mixed-effects models, RandomForest algorithm and Support Vector Machine (SVM). For more information about:
Daltonism is a continuous variable, so we will use 4 measures of regression model accuracy:
Model with all covariates:
MAE and RMSE:
Figure 86. Measures of model regression accuracy (MAE and RMSE), with the model including all covariates.
The unit in this graph is linked to daltonism unit: as a reminder, daltonism varies between 0 and 10.68 (percentage of color-blind males in the population). The values for MSE, RMSE and MAE in a good model tend to be close to 0.
However, the unit for Rsquared is a percentage varying between 0 and 100, where a good model tends to be close to 100:
Rsquared:
Figure 87. Measures of model regression accuracy (R-squared), with the model including all covariates.
Model with only UVR, presence of a blue word, quadratic effect of latitude of putative language family origin:
MAE and RMSE:
Figure 88. Measures of model regression accuracy (MAE and RMSE), with the model including only UVR, Blue, latitude² of language family origins.
Figure 89. Measures of model regression accuracy (R-squared),with the model including only UVR, Blue, latitude² of language family origins.
The performance rate is very similar for all algorithms, and the explained variance is quite low for bayesian mixed-effect model, random forest and SVM. Thus, a simple mixed-effect bayesian model is enough to catch the explanatory power of our variables for predicting daltonism.
However, our variables do not explain a full account of the variance of daltonism rate, and the error rate is still quite high.
Note: In previous parts, when we performed bayesian mixed-effect model and random forest algorithm, we computed the R-squared and we got estimates varying between 30 and 45%. However, these estimation were made on the full dataset, and not on only testing set.
Thus:
In this study, I investigated if differences in eye physiology across populations may help explain differences in the colour lexicon and on the frequency of colour blindness. By performing Point Pattern Analysis, mediation analysis, mixed-effects Bayesian regressions, decision trees/random forests and Support Vector Machines on a set of 142 populations, I explored the relationship between UV incidence and the presence of a specific term for blue on the one hand, and between UV incidence and the frequency of red-green abnormal colour perception in males, on the other hand.
Point pattern analysis operates in a specific conceptual framework where points are events with an expected probability of occurrence at any location. However, the location of populations is constrained by many factors. Analysis of spatial randomness and dependence added to the covariate analysis highlighted the obvious repartition of languages on earth: humans inhospitable places and choose to live on continents (as opposed to water). Our dataset sample is quite representative of the world populations, although our sample seems very rich in Europe and India, whereas other regions like Australia and Papua New Guinea are not as well-covered.
We also found that our points attributes (i. e. Blue and daltonism) are not dispersed randomly across populations. Indeed, both spatial autocorrelation and neighbouring coefficient analysis indicate that the same attributes tend to be located in the same areas.
Furthermore, genetic distance (computed with ultrametric data imputation) results in the highest autocorrelation for colour blindness and the presence of a specific blue term. As inherited red-green colour blindness is due to variation at the level of the DNA sequence (Stelzer et al. (2016)), it is tempting to suggest that genetic distance is consistent with the physiological explanation of colour blindness. Indeed, we assume that overall genetic distances reflect population demographic history, which contains the history of the color-blindness-related DNA sequences.
Yet, the association between genetic distance and the distribution of a word for blue is not straightforward to interpret. As (Komarova, Jameson, & Narens (n.d.)) suggest in their evolutionary models, colour blindness potentially affects the colour lexicon. (Lindsey & Brown (2002)) also claimed that abnormal colour perception in the green-blue range would change colour vocabulary, and especially the presence of a specific term for blue. One could also say that populations that have a short genetic distance between each other would share a common history and culture which explain similarities in colour lexicons.
Introduction:
There is a lot of evidence supporting the universalist perspective (Heider & Olivier (1972); Berlin & Kay (1991); Regier & Kay (2009)), which proposes that the development of the colour lexicon is very similar across all languages and human groups, due to shared commonalities in the eye physiology and cognition. In this thesis, I do not directly challenge this finding, but try to understand how patterns of variation between languages in the presence of a specific term for “blue” could emerge. Contrary to previous universalist work, which assume that the physiological basis of vision in the eye is the same for all humans, I do not bring the eye physiology as a factor driving similarities for colour vocabularies in the world. Instead, I assume that differences in the eye physiology could be responsible for changes in colour perception; in doing so, I bring the eye physiology as a factor driving variation across languages.
Our result’s summary:
As suggested by (Lindsey & Brown (2002)), I found that UV incidence is a predictor for Blue. However, other variables in our dataset are also predictive for the presence of a specific term for blue, such as the number of colour categories which has a strong positive effect. Other cultural (population size and subsistence mode), environmental (distance to lakes) and geographical variables (latitude and longitude of putative language family origins) are also included in the set of relevant predictors. Finally, the mediation analysis showed that the effect of latitude on Blue is fully mediated by UV incidence.
Explanation of our results according to the relativist perspective, and relative contribution of different factor to the construction of colour categories:
The relativist perspective, as presented by the work of Davidoff, Davies, & Roberson (1999) and Roberson, Davidoff, Davies, & Shapiro (2004), has highlighted the idea that different environments and cultures may be the source of colour lexicons differences. In this thesis, the addition of other (1) cultural, (2) environmental and (3) genetic variables in the model put forward their relative contribution to understanding the presence of a specific word for blue.
Cultural. Berlin & Kay (1991) proposed that the development of dyeing techniques would lead societies to have a greater need for additional colour terms. Also, several studies suggest a significant association between population size and cultural complexity (Shennan (2001); Henrich (2004); Powell, Shennan, & Thomas (2009); Premo & Kuhn (2010); Derex, Beugin, Godelle, & Raymond (2013)). Following those two assumptions, both our variables number of colour categories and population size would be indicators for cultural complexity in our models. Consequently, the important role of the number of colour categories and the population size in our results, brings forward cultural complexity as one of the main contributors to differences between the colour lexicons in what concerns the presence of a specific word for blue, consistent with Ember (1978)’s findings. Gibson et al. (2017) also emphasized the role of culture in the evolution of the colour lexicon. They suggest that « salient » objects – to which humans have a need to refer to - are usually characterized by their warm colours (red, yellow…), whereas « uninformative » backgrounds are usually cool-coloured (for example with blue, green, and brown colours), leading to the under-representation of cool colours in color lexicons. Because industrialization generate objects distinguishable solely based on colour, they postulate industrialized cultures would have a greater need for additional colour categories, resulting in the appearance of extra categories for « blue », « brown » and « grey » colours.
Environmental. Likewise, the distance to lakes would contribute to the presence of a specific term for blue; however, we do not have any explanation about the reasons underlying this finding. According to Goldstone (1998), what is available in the perceptual input affects the development of a particular perceptual skill: thus, it is likely that our physical surroundings could play a role in colour lexicons formation. Though, the lack of predictive strength for the climate variables (which determined if the populations live in a tropical, cool, desertic, or oceanic climates) and other distance to water variables suggest that the effect of our physical surrounding on colour lexicons development is low. It is also possible that we did not include the right measures, but as our variables include a lot of information, it is not very likely to be the case.
Genetic. The genetic distance variables included in the model, which give a partial account of the history of people migrations, did not contribute to the presence of a specific word for blue. However, the multi-dimensional scaling applied to the genetic distance matrices did not result in a high goodness-of-fit, and we believe the variables used were not fully reflecting genetic distances. Also, we know that the history of migrations and contact contributed significantly to colour vocabulary. For example, it is believed that the Japanese started to separate blue and green colours when educational materials were imported from the Allied after WW2, and after coloured pencils have been imported to Japan (Bhatia (2018)).
Those findings are fully compatible with our first hypothesis, namely that UV incidence drives changes in the colour lexicon. Indeed, we assume that the observed variation in the colour lexicon of the world’s languages does not depend on a single underlying factor. In this thesis, I find UV incidence as a significant predictor for the presence of a specific blue term, even after controlling for various ecological and cultural confounds. Thus, even though UV incidence is not the main factor impacting the presence or absence of a specific term, this analysis tends to support the idea that UV incidence has its own contribution to the structure of the colour vocabulary, of course, among other predictors.
Causal pattern:
The pattern found suggests that there may be a causal relationship going from UV incidence to Blue, the inverse causal relationship being very unlikely. The chain of causal events in Illustration 6 proposed by @Lindsey & Brown (2002) is a possible explanation for this relationship, even if I do not exclude the possibility that UV incidence could affect the presence of a specific blue term based on other causal assumptions. As an example, one can postulate that UV incidence could partially impact food production system, thus affecting economic development; however, I believe that our climate variables would reflect more this sort of causal assumptions. Also, UV incidence and Blue could be proxies for other variables, but this is probably less likely as we controlled for several cultural and environmental variables. Thus, the causal assumptions linking UV incidence to macular pigment density, and macular pigment density to abnormal colour perception in the green-blue range, remains, in our view, the most probable explanation for the potential effect of UV incidence on colour naming.
Role of physiology:
Therefore, some physical variables – such as UV incidence – could have an indirect role on semantic categories through their effects on physiology. Geographical location, which partly determines the amount of UV incidence reaching the eye, would also indirectly contribute to colour naming through its mediation relationship with UV incidence and, consequently, physiology.
Result’s summary:
Using the same methods, we found that UV incidence is strongly negatively related to the population frequency of red-green abnormal colour perception, even after controlling for various ecological and cultural confounds. UV incidence is the strongest predictor according to both random forests and Bayesian models (if latitude is excluded). Indeed, the effect of latitude on daltonism is fully mediated by UV incidence.
Causal pattern:
The strength of UV incidence as a predictive factor for daltonism may suggest, at face value, that UV incidence plays a role in determining male red-green colour blindness rate. If UV incidence and daltonism are not proxies for other unmeasured variables, then a possible causal relationship may go from UV incidence to red-green colour blindness: UV radiations can cause DNA mutations (Pfeifer, You, & Besaratinia (2005)), with skin cancer being an example, but the most common colour blindness conditions are inherited and there is no evidence that UV incidence could cause DNA mutations in gametes. Moreover, the red-green colour blindness mutation is present only in places where the UV incidence rate is low. Therefore, UV incidence cannot be a direct explanation for mutations in the opsin genes. The chain of causality proposed by D. T. Brown & Lindsey (2004) and summarized in Illustration 6 becomes then a much more probable explanation describing the strong impact of UV incidence on daltonism. However, I do not set aside the possibility that UV incidence and daltonism are linked through other causal assumptions.
Together, these results put forward the role of physical factors in changing the frequency of genetic changes by generating different selective pressures in different populations. Like colour vocabulary, colour blindness can be linked to geographical location through its mediating relationship with UV incidence and eye physiology (see Illustration 6).
Relative contribution of other variables:
Subsistence strategy. Other variables also play a significant role in predicting daltonism. Corroborating with our hypothesis, subsistence strategy is slightly associated, with hunter-gatherers being less likely to have red-green abnormal colour perception. We can suggest that the selective pressure for hunter-gatherers is higher than in populations based on food production. Interestingly, hunter-gatherers in our dataset have a fairly low colour-blindness rate irrespective of UV incidence, meaning that selective pressure on colour perception may indeed exist. However, even though we attempted to have a representative sample of the world’s language, there were few examples of hunter-gatherers communities, making it difficult to conclude about their potential effect in our analysis.
Blue. Having a dedicated word for blue is significantly positively related to the frequency of red-green abnormal colour perception: population having a high red-green colour blindness rate tend to have a specific term for blue in their vocabulary. We suggest that this is, in fact, a proxy for high levels of acquired abnormal blue-green colour perception (see 1st hypothesis and Illustration 9). One can also suggest other explanations not linked to our hypothesis: for example, it is possible that red-green colour blindness has a direct impact on colour categorisation.
Our hypotheses did not set a causal effect going from red-green colour-blindness to presence of a blue word in colour lexicons, but only a correlation between the two as they would be both linked to abnormal blue-green colour perception. Yet, we believe that inter-individual variation in abnormal colour perception (such as the inherited abnormal red-green colour blindness) can possibly also lead to a reorganization in colour semantics.
Mutations on opsin genes, causing deuteranopia, deuteranomaly, protanopia or protanomaly, are known to be responsible for the presence of atypical cone cells, thus affecting the eye physiology. Due to a lack of information about the population frequency specifically of the variants involved in abnormal color perception, I do not have any evidence for mutations on opsin genes to be indirect contributors to colour lexicons through their mediating effect on the eye physiology. However, our results suggest that a defect of perception in the blue-green colour dimension contributes to the development of the colour lexicon. As genetic mutations in the opsin genes are known to be responsible for the presence of red-green colour blind individuals, one may wonder if genetics could remodel colour categories at a higher scale. Computer simulations (Komarova et al. (n.d.)) suggest that the presence of colour blind agents in a population can cause changes in the number of colour categories and/or the location of colour category boundaries.
Real world’s data do not put forward any evidence for colour lexicons differences in populations where deuteranomaly rates are high; however, places with high red-green abnormal colour perception rates correlate with places where the dyeing system is very developed, making it difficult to draw conclusions.
The causal relationships suggested by the results of this analysis can be represented graphically as shown in the figure below:
However, our analysis has several shortcomings.
Genetic distance matrices. First, genetic distance matrices computed from ALFRED were computed with only 478 microhaplotypes and SNPs and, therefore, are not representative of the whole human genome. Though, the SNPs selected were extracted from a set of AISNP (Ancestry Inference SNP), which are loci very informative for population history and migration. Also, the matching of populations in our dataset and the populations in ALFRED is imperfect (see Annexe Genetic distance matrix, section Step 4: Link populations from FST table to our dataset’s population. for more information). Also, there is no genetic frequency data concerning the opsin genes, responsible for colour perception.
Location of boundaries in colour categorization. Second, our current database is not sufficient to test the claims put forward in computational studies, namely that abnormal colour perception can lead to a redefining of colour categories boundaries, as the exact location of category boundaries are not present in our dataset.
Index of level of exposure to dyeing techniques. Third, this study lacks an index for population exposure to manufactured goods and their level of exposure to dyeing techniques. We hoped that some of these distinctions may be reflected in the subsistence mode, with hunter-gatherer population having less access to dyed products, and population size, where populations bigger than a few hundred thousand people being more likely to have access to dyeing techniques. Country development indexes were not relevant in our case, as we mainly studied specific populations inside countries. More generally, indexes of development for specific populations or isolated communities are mainly old-fashioned and can create ethical debates. A specific measure for exposure to dyeing technology was too specific to be found in the anthropological databases.
Unique location of each population. Fifth, the different measures for distance to water (distance to rivers, distance to oceans, ditsance to lakes), as well as UV incidence, could only be computed with a fixed latitude and longitude for each population. However, population with more than a few hundreds people can occupy a large territory. Thus, we expect those variables to be a global estimation (an order of magnitude) for how distant they live from lakes, rivers or oceans, and not a precise indication.
Colour blindness test. Finally, we assume that our measure of daltonism may be faulty in some ways. Even though the tests used are standardized, this measure can be submitted to variation due to different experimenters, contexts and locations. Though, out of the numerous populations where data for colour blindness was present, a meticulous and rigorous work has been made (Meeussen (2015)) to select only populations where the data was of sufficient quality.
We brought some evidence that real-world data are consistent with the hypothesis that both red-green colour blindness and the presence of a specific term for blue could be partly determined by UV incidence. Colour vocabulary is directly influenced by environmental and cultural factors, but would also be indirectly influenced through their mediating relationship with human physiology. Physiology, which has already been designated to potentially shape the structure of languages’ phonemes (Moisik & Dediu (2017); Dediu, Janssen, & Moisik (2019); Blasi et al. (2019)), is then likely to also affect languages’ semantic categories. Likewise, physiology would also be a mediator between UV incidence and inherited red-green abnormal colour perception.
Together, our findings suggest that individual-level effects (here, an acquired defect in blue-green colour range) can reshape both colour lexicons and inherited red-green abnormal colour perception. Understanding to which extent thrichromacy is positively selected depending on the subsistence strategy would give additional information to our findings concerning the distribution of abnormal red-green colour blindness in the world.
Also, it would be interesting to investigate to what extent other individual effects, such as the presence of colour-blind individuals in the population, could influence colour categories development. Specifically, it would be interesting to study closer the evolution of colour lexicons on Pingelap island. Rare cases of tetrachromatic individuals (Deeb (2005)), if they expand in the population, would also be a relevant field of investigation for colour lexicons changes.
In a cross-linguistic work, Malt & Majid (2013) provide some evidence for a great diversity in how languages label categories. Both universal and relativist factors can explain the pattern of semantic categories seen across the world. However, studying their effect through their mediating effect of physiology provide relevant leads for further exploration.
Terms definition and explanation:
Fixation Index (FST) is an index measuring genetic distance between populations. It quantifies levels of breeding across populations by comparing the frequency of genes in each population in pairwise comparisons. High values for FST indicate a high population differentiation, whereas small values show a high breeding rate between populations.
Arlequin (Excoffier & Lischer, 2010) is a software for population genetics data analysis which has been used to compute FST with Alfred database. Its goal is to provide users with a large set of basic methods and statistical tests, in order to extract information on genetic and demographic features of a collection of population samples.
The following files have been extracted from Alfred database and FROG-kb (Rajeevan, 2003):
| Panel | Allele Frequency | Population sample | SNP | Microhap |
|---|---|---|---|---|
| Microhaplotypes | Microhap_alleleF_198.txt | 96 | 0 | 198 |
| Seldin’s list of 128 AISNPs | Seldin128_alleleF.txt | 70 | 128 | 0 |
| SNPforID 34-plex | SNPForId34_alleleF.txt | 53 | 34 | 0 |
| KiddLab - Set of 55 AISNPs | KiddLab55_alleleF.txt | 139 | 55 | 0 |
| Kayser’s set of 24 Ancestry Informative Markers | Kayser24_alleleF.txt | 73 | 24 | 0 |
| Daniele Podini’s list of 32 AISNPs | Podini32_alleleF.txt | 111 | 29 | 0 |
| Eurasiaplex 23 SNP Panel | Eurasiaplex23_alleleF.txt | 76 | 23 | 0 |
| Nievergelt’s Set of 41AIMs | Nievergelt41_alleleF.txt | 123 | 41 | 0 |
| Overlap set of AISNPs | Overlap44_alleleF.txt | 72 | 44 | 0 |
| Li’s panel of 74 AIMS | Li74_alleleF.txt | 67 | 73 | 0 |
The different text files have been gathered under one single text-file and transformed into arlequin files (.arp) through a R script (see github mathjoss/Alfred2FST/alfredtxt2arlequin.R). This R-script creates one .arp file for each micro-haplotype and each SNP.
These .arp files were then feeded to Arlequin (Excoffier & Lischer, 2010) (number of permutations=110). All Arlequin parameters used for the analysis can be found on github: mathjoss/Alfred2FST/arl_run.ars. The outputs provided by Arlequin are folders for each file.
We focused on a table called ‘Population average pairwise difference’ which contains the average number of pairwise difference between populations above diagonal, within population in the diagonal elements, and the corrected average pairwise difference below diagonal.
Those tables have been averaged under one single table with two other R-scripts (see github mathjoss/Alfred2FST/mean_all_files.R and mathjoss/Alfred2FST/mean_all_files_part2.R). If some population were absent of a specific micro-haplotype or SNP, the mean was performed with missing values.
So we got a population average pairwise difference distance matrix on:
Then, additive method and ultrametric methods (De Soete, 1984; Lapointe & Kirsch, 1995) were used for data imputation on the final distance matrix. For more details about the exact algorithm, see «A weighted least-squares approach for inferring phylogenies from incomplete distance matrices» (Makarenkov, 2004). The R-code for this process is available in github: mathjoss/Alfred2FST/extract_MDS.R. This R-file was also used to perform Multi-Dimensional Scaling (cmdscale).
The 145 populations in the FST table are quite different from the 142 populations present in our dataset. Furthermore, some categories in Alfred database are large and unclear (African Americans, Indian Mixed…) and did not suit to our populations. Consequently, the following procedure was applied:
Results from (1) and (2) are gathered in the “genetic_incomplete” column in Database_final4.csv. Results from (1), (2) and (3) are gathered in the “genetic_full” column in Database_final4.csv. This column has been used for further analysis.
Please note that, in the previous computations, only the below diagonal element were used, corresponding to the corrected average pairwise difference.
Baddeley, A., Rubak, E., & Turner, R. (2015). Spatial point patterns: Methodology and applications with R. Retrieved from http://www.crcpress.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/
Bentz, C., Dediu, D., Verkerk, A., & Jäger, G. (2018). The evolution of language families is shaped by the environment beyond neutral drift. Nature Human Behaviour, 2(11), 816. https://doi.org/10.1038/s41562-018-0457-6
Berlin, B., & Kay, P. (1991). Basic Color Terms: Their Universality and Evolution. University of California Press.
Bhatia, A. (2018). The crayola-fication of the world: How we gave colour names, and it messed with our brain (part I). Retrieved from http://www.empiricalzeal.com/2012/06/05/the-crayola-fication-of-the-world-how-we-gave-colors-names-and-it-messed-with-our-brains-part-i/
Bickel, B., Nichols, J., Zakharko, T., Witzlack-Makarevich, A., Hildebrandt, K., Rießler, M., … Lowe, J. B. (2017). The AUTOTYP typological databases (Version 0.1.0) [GitHub database]. Retrieved from https://github.com/autotyp/autotyp-data/tree/0.1.0
Blasi, D. E., Moran, S., Moisik, S. R., Widmer, P., Dediu, D., & Bickel, B. (2019). Human sound systems are shaped by post-Neolithic changes in bite configuration. Science, 363(6432), eaav3218. https://doi.org/10.1126/science.aav3218
Brown, A. M., & Lindsey, D. T. (2004). Color and language: Worldwide distribution of Daltonism and distinct words for "blue". Visual Neuroscience, 21(3), 409–412. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/15518222
Bürkner, P.-C. (2018a). Advanced Bayesian Multilevel Modeling with the R Package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017
Bürkner, P.-C. (2018b). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017
Cysouw, M., Dediu, D., & Moran, S. (2012). Comment on “Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa”. Science, 335(6069), 657. https://doi.org/10.1126/science.1208841
Davidoff, J., Davies, I., & Roberson, D. (1999). Colour categories in a stone-age tribe. Nature, 398(6724), 203–204. https://doi.org/10.1038/18335
Dediu, D., Janssen, R., & Moisik, S. R. (2019). Weak biases emerging from vocal tract anatomy shape the repeated transmission of vowels. Nature Human Behaviour, 3(10), 1107–1115. https://doi.org/10.1038/s41562-019-0663-x
Deeb, S. (2005). The molecular basis of variation in human color vision: Variation in human color vision. Clinical Genetics, 67(5), 369–377. https://doi.org/10.1111/j.1399-0004.2004.00343.x
Derex, M., Beugin, M.-P., Godelle, B., & Raymond, M. (2013). Experimental evidence for the influence of group size on cultural complexity. Nature, 503(7476), 389–391. https://doi.org/10.1038/nature12774
De Soete, G. (1984). Ultrametric tree representations of incomplete dissimilarity data. Journal of Classification, 1(1), 235–242. https://doi.org/10.1007/BF01890124
Ember, M. (1978). Size of Color Lexicon: Interaction of Cultural and Biological Factors. American Anthropologist, 80(2), 364–367. https://doi.org/10.1525/aa.1978.80.2.02a00100
Everett, C. (2013). Evidence for Direct Geographic Influences on Linguistic Sounds: The Case of Ejectives. PLoS ONE, 8(6), e65275. https://doi.org/10.1371/journal.pone.0065275
Excoffier, L., & Lischer, H. E. L. (2010). Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3), 564–567. https://doi.org/10.1111/j.1755-0998.2010.02847.x
Gibson, E., Futrell, R., Jara-Ettinger, J., Mahowald, K., Bergen, L., Ratnasingam, S., … Conway, B. R. (2017). Color naming across languages reflects color use. Proceedings of the National Academy of Sciences, 114(40), 10785–10790. https://doi.org/10.1073/pnas.1619666114
Goldstone, R. L. (1998). PERCEPTUAL LEARNING. Annual Review of Psychology, 49(1), 585–612. https://doi.org/10.1146/annurev.psych.49.1.585
Hammarström, H., Bank, S., Forkel, R., & Haspelmath, M. (2018). Glottolog 3.2. Retrieved from http://glottolog.org
Hardy, L., Rand, G., & Rittler, M. (1954). HRR polychromatic plates. JOSA, 44(7), 509–521.
Heider, E. R., & Olivier, D. C. (1972). The structure of the color space in naming and memory for two languages. Cognitive Psychology, 3(2), 337–354. https://doi.org/10.1016/0010-0285(72)90011-4
Henrich, J. (2004). Demography and Cultural Evolution: How Adaptive Cultural Processes Can Produce Maladaptive Losses—The Tasmanian Case. American Antiquity, 69(2), 197–214. https://doi.org/10.2307/4128416
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. https://doi.org/10.1198/106186006X133933
Ishihara, S. (1917). Tests for color-blindness. Tokyo: Hongo Harukicho.
Kirby, K. R., Gray, R. D., Greenhill, S. J., Jordan, F. M., Gomes-Ng, S., Bibiko, H.-J., … Gavin, M. C. (2016). D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLOS ONE, 11(7), e0158391. https://doi.org/10.1371/journal.pone.0158391
Knoll, H. A. (1968). Patent No. US3382025A. Retrieved from https://patents.google.com/patent/US3382025A/en
Komarova, N. L., Jameson, K. A., & Narens, L. (n.d.). Evolutionary Models of Color Categorization Based on Discrimination. 46.
Lapointe, F. J., & Kirsch, J. a. W. (1995). Estimating Phylogenies from Lacunose Distance Matrices, with Special Reference to DNA Hybridization Data. Molecular Biology and Evolution, 12(2), 266–266. https://doi.org/10.1093/oxfordjournals.molbev.a040209
Lindsey, D. T., & Brown, A. M. (2002). Color naming and the phototoxic effects of sunlight on the eye. Psychological Science, 13(6), 506–512.
Malt, B. C., & Majid, A. (2013). How thought is mapped into words: How thought is mapped into words. Wiley Interdisciplinary Reviews: Cognitive Science, 4(6), 583–597. https://doi.org/10.1002/wcs.1251
Meeussen, E. (2015). Colour blindness and its contribution to colour vocabulary (Master’s thesis). Radboud Universiteit Nijmegen, Nijmegen, The Netherlands.
Moisik, S. R., & Dediu, D. (2017). Anatomical biasing and clicks: Evidence from biomechanical modeling. Journal of Language Evolution, 2(1), 37–51. https://doi.org/10.1093/jole/lzx004
Moran, P. A. P. (1950). Notes on Continuous Stochastic Phenomena. Biometrika, 37(1/2), 17–23. https://doi.org/10.2307/2332142
Pfeifer, G. P., You, Y.-H., & Besaratinia, A. (2005). Mutations induced by ultraviolet light. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 571(1-2), 19–31. https://doi.org/10.1016/j.mrfmmm.2004.06.057
Powell, A., Shennan, S., & Thomas, M. G. (2009). Late Pleistocene Demography and the Appearance of Modern Human Behavior. Science, 324(5932), 1298–1301. https://doi.org/10.1126/science.1170165
Premo, L. S., & Kuhn, S. L. (2010). Modeling Effects of Local Extinctions on Culture Change and Diversity in the Paleolithic. PLoS ONE, 5(12), e15582. https://doi.org/10.1371/journal.pone.0015582
Rajeevan, H. (2003). ALFRED: The ALelle FREquency Database. Update. Nucleic Acids Research, 31(1), 270–271. https://doi.org/10.1093/nar/gkg043
R Core Team. (2020). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/
Regier, T., & Kay, P. (2009). Language, thought, and color: Whorf was half right. Trends in Cognitive Sciences, 13(10), 439–446. https://doi.org/10.1016/j.tics.2009.07.001
Roberson, D., Davidoff, J., Davies, I. R. L., & Shapiro, L. R. (2004). The Development of Color Categories in Two Languages: A Longitudinal Study. Journal of Experimental Psychology: General, 133(4), 554–571. https://doi.org/10.1037/0096-3445.133.4.554
Shennan, S. (2001). Demography and Cultural Innovation: A Model and its Implications for the Emergence of Modern Human Culture. Cambridge Archaeological Journal, 11(1), 5–16. https://doi.org/10.1017/S0959774301000014
Spielman, S. E. (2017). Point Pattern Analysis. In International Encyclopedia of Geography (pp. 1–9). https://doi.org/10.1002/9781118786352.wbieg0849
Stelzer, G., Rosen, N., Plaschkes, I., Zimmerman, S., Twik, M., Fishilevich, S., … Lancet, D. (2016). The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Current Protocols in Bioinformatics, 54(1). https://doi.org/10.1002/cpbi.5
Thomson, W. (1880). An instrument for the detection of color blindness. Transactions of the American Ophthalmological Society, 3, 142.
Tingley, D., Yamamoto, T., Hirose, K., Keele, L., & Imai, K. (2014). Mediation: R Package for Causal Mediation Analysis. Journal of Statistical Software, 59(1), 1–38. https://doi.org/10.18637/jss.v059.i05
Turchin, P., Brennan, R., Currie, T., Feeney, K., Francois, P., Hoyer, D., … Whitehouse, H. (2015). Seshat: The Global History Databank. Cliodynamics, 6(1). https://doi.org/10.21237/C7clio6127917
Wichmann, S., Müller, A., & Velupillai, V. (2010). Homelands of the world’s language families: A quantitative approach. Diachronica, 27(2), 247–276. https://doi.org/10.1075/dia.27.2.05wic